I am trying to read the body of a web service call with the requests module in Python 2.7. The body of the result is CSV data.
The request is just a GET. I can run the url in Chrome and it returns a complete response of 200 rows. The url looks like:
http://localhost:8080/summary-report?endTime=9999999&startTime=0
When I try to get the data from Python, it retrieves only the first row of data. I look at the response header and see the length of the response is only 122 bytes, about the length of the first row. I have run it a number of times from Python and it is consistent in returning only the first row.
The code is only:
r = requests.get(url)
ans = r.content
Could it be some with with the new line characters on each line of the CSV file?
Or because I am using localhost in the url?
I also tried it with 127.0.0.1 but see a similar behavior.
Possibly the ampersand is an issue in the URL?
You can get dinamically generated content with request.get()
You need JavaScript Engine to parse and run JavaScript code inside the page. Some of those can be found below:
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
Related
I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.
For reference:
Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG
Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG
Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.
Help!
#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
## Accessing and Creating Six Digit File Code
pdf_dir = "/users/USERNAME/desktop/worky"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
## Identify File Name and Limit to Digits
filename = os.path.basename(file)
newname = filename[0:6]
## Run PDFX to identify and download links
pdf = pdfx.PDFx(filename)
url_list = pdf.get_references_as_dict()
attachment_counter = (1)
for x in url_list["url"]:
if x[0:4] == "http":
parsed_url = urllib.parse.quote(x, safe='://')
print (parsed_url)
wget.download(parsed_url, '/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
##os.rename(r'/users/USERNAME/desktop/worky/(filename).*',r'/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
attachment_counter += 1
for x in url_list["pdf"]:
print (parsed_url + "\n")```
I prefer to use requests (https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget and I got the same error (might be linked to user-agent HTTP headers used by wget).
wget and HTTP headers issues : download image from url using python urllib but receiving HTTP Error 403: Forbidden
HTTP headers : https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
The good thing with requests is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).
import requests
r = requests.get("https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG")
with open("myfile.png", "wb") as file:
file.write(r.content)
I'm not sure I understand what you're trying to do, but maybe you want to use formatted strings to build your URLs (https://docs.python.org/3/library/stdtypes.html?highlight=format#str.format) ?
Maybe checking string indexes is fine in your case (if x[0:4] == "http":), but I think you should check python re package to use regular expressions to catch the elements you want in a document (https://docs.python.org/3/library/re.html).
import re
regex = re.compile(r"^http://")
if re.match(regex, mydocument):
<do something>
The reason for this behavior is inside wget library. Inside it encodes the URL with urllib.parse.quote() (https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote).
Basically it replaces characters with their appropriate %xx escape character. Your URL is already escaped but the library does not know that. When it parses the %20 it sees % as a character that needs to be replaced so the result is %2520 and different URL - therefore 403 error.
You could decode that URL first and then pass it, but then you would have another problem with this library because your URL has parameter filename*= but the library expects filename=.
I would recommend doing something like this:
# get the file
req = requests.get(parsed_url)
# parse your URL to get GET parameters
get_parameters = [x for x in parsed_url.split('?')[1].split('&')]
filename = ''
# find the get parameter with the name
for get_parameter in get_parameters:
if "filename*=" in get_parameter:
# split it to get the name
filename = get_parameter.split('filename*=')[1]
# save the file
with open(<path> + filename, 'wb') as file:
file.write(req.content)
I would also recommend removing the utf-8'' in that filename because I don't think it is actually part of the filename. You could also use regular expressions for getting the filename, but this was easier for me.
I'm writing a python script to parse jenkins job results. I'm using urllib2 to fetch consoleText, but the file that I receive isn't full. The code to fetch the file is:
data = urllib2.urlopen('http://<server>/job/<jobname>/<buildid>/consoleText')
lines = data.readlines()
And the number of lines I get is 2306, while the actual number of lines in the console log is 37521. I can check that buy fetching the file via wget:
$ wget 'http://<server>/job/<jobname>/<buildid>/consoleText'
$ wc -l consoleText
37521
Why does urlopen not give me the full result?
UPDATE:
Using requests (as suggested by #svrist) instead of urllib2 doesn't have such a problem, so I'm switching to it. My new code is:
data = requests.get('http://<server>/job/<jobname>/<buildid>/consoleText')
lines = [l for l in data.iter_lines()]
But I still have no idea why urllib2.urlopen doesn't work properly.
The Jenkins build log is returned using a chunked encoding response.
Transfer-Encoding: chunked
Based on a couple of other questions, it seems like urllib2 does not handle the entire response and as you've observed, only returns the first chunk.
I also recommend using the requests package.
I'm quite new at using python for http request and so far I have a script
that is fetching XML files from a long list of URL's (on the same server) to then extract data from nodes with lxml.
Everything works fine however i'm a bit concern about the huge number of request the host server might receive from me.
Is there a way using "request" to send only one request to the server that will fetch all the XML from the different URL and store them in a tar.gz file?
Here what is doing my script so far (with a small sample):
IDlist = list(accession_clean)
URLlist = ['http://www.uniprot.org/uniprot/Q13111.xml', 'http://www.uniprot.org/uniprot/A2A2F0.xml', 'http://www.uniprot.org/uniprot/G5EA09.xml', 'http://www.uniprot.org/uniprot/Q8IY37.xml', 'http://www.uniprot.org/uniprot/O14545.xml', 'http://www.uniprot.org/uniprot/O00308.xml', 'http://www.uniprot.org/uniprot/Q13136.xml', 'http://www.uniprot.org/uniprot/Q86UT6.xml']
for id, item in zip(IDlist, URLlist):
try:
textfile = urllib2.urlopen(item);
except urllib2.HTTPError:
print("URL", item, "could not be read.")
continue
try:
tree = etree.parse(textfile);
except lxml.etree.XMLSyntaxError:
print 'Skipping invalid XML from URL {}'.format(item)
continue
That website offers an API which is documented here, although it will not give you a tar.gz file, it is possible to retrieve multiple entries with a single HTTP request.
Perhaps one of the batch or query methods will work for you.
Sorry if this is redundant but I've tried very hard to look for the answer to this but I've been unable to find one. I'm very new to this so please bear with me:
My objective is for a piece of code to read through a csv full of urls, and return a http status code. I have Python 2.7.5. The outcome of each row would give me the url and the status code, something like this: www.stackoverflow.com: 200.
My csv is a single column csv full of hundreds of urls, one per row. The code I am using is below and when I run this code, it gives me a /r separating two urls similar to this:
{http://www.stackoverflow.com/test\rhttp://www.stackoverflow.com/questions/': 404}
What I would like to see is the two urls separated, and each with their own http status code:
{'http://www.stackoverflow.com': 200, 'http://www.stackoverflow.com/questions/': 404}
But there seems to be an extra \r when Python reads the csv, so it doesn't read the urls correctly. I know folks have said that strip() isn't an all inclusive wiper-outer so any advice on what to do to make this work, would be very much appreciated.
import requests
def get_url_status(url):
try:
r = requests.head(url)
return url, r.status_code
except requests.ConnectionError:
print "failed to connect"
return url, 'error'
results = {}
with open('url2.csv', 'rb') as infile:
for url in infile:
url = url.strip() # "http://datafox.co"
url_status = get_url_status(url)
results[url_status[0]] = url_status[1]
print results
You probably need to figure out how your csv file is formatted, before feeding it to Python.
First, make sure it has consistent line endings. If it has newlines sometimes, and crlf others, that's probably a problem that needs to be corrected.
If you're on a *ix system, tr might be useful.
I have a program where I need to open many webpages and download information in them. The information, however, is in the middle of the page, and it takes a long time to get to it. Is there a way to have urllib only retrieve x lines? Or, if nothing else, don't load the information afterwards?
I'm using Python 2.7.1 on Mac OS 10.8.2.
The returned object is a file-like object, and you can use .readline() to only read a partial response:
resp = urllib.urlopen(url)
for i in range(10):
line = resp.readline()
would read only 10 lines, for example. Note that this won't guarantee a faster response.