Python urllib retrieve is slow

Python urllib retrieve is slow - python

So I am trying to parse a JSON file of Magic the Gathering image urls and download them.
I wrote the script below and was banging my head against the wall before each request was taking upwards of 3 or 4 minutes to download. I thought at first I was being throttled by the site, but that wasn't the case.
I ended up switch from my cmd shell to a git bash shell, ran the script, and it worked as intended so I believed I solved it. Well now even in a git bash shell the code is going slow and the only thing that was changed was the set I was looking in. I tried disabling the output throttle with '-u' but that doesn't help.
The "done" never gets printed, even though I know its loading all the json.
If I put a print statement in the for loop before my if check, "done" gets printed. It will get to the printing of the first filename but thats it.
import json
import urllib
import time
with open('scryfall-oracle-cards.json') as f:
data = json.load(f)
print("done")
count = 0
for x in data:
if x['set'] == "dom":
cropUrl = x["image_uris"]["art_crop"]
cardname = x['name'].replace(' ', '_') + "__" + x['set']
fileName_crop = cardname + "_crop.jpg"
print(fileName_full)
time.sleep(.1)
urllib.urlretrieve(cropUrl, fileName_crop)

A great, easy, and accepted library for HTTP requests is requests.
Read the requests documentation.

Related

How can I speed up this python script to read and process a csv file?

I am trying to process a relatively large (about 100k lines) csv file in python. This is what my code looks like:
#!/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import csv
import os
csvFileName = sys.argv[1]
with open(csvFileName, 'r') as inputFile:
parsedFile = csv.DictReader(inputFile, delimiter=',')
totalCount = 0
for row in parsedFile:
target = row['new']
source = row['old']
systemLine = "some_curl_command {source}, {target}".format(source = source, target = target)
os.system(systemLine)
totalCount += 1
print "\nProcessed number: " + str(totalCount)
I'm not sure how to optimize this script. Should I use something besides DictReader?
I have to use Python 2.7, and cannot upgrade to Python 3.

If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. Like
$ python your_script.py 1.csv &
$ python your_script.py 2.csv &
Ampersand stands for background execution in linux envs. More details here. I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol.
Anyway it's much better to stick with multiprocessing, ofc.
What about to use requests instead of curl?
import requests
response = requests.get(source_url)
html = response.content
with open(target, "w") as file:
file.write(html)
Here's the doc.
Avoid print statements, in long-term run they're slow as hell. For development and debugging that's ok, but when you decide to start final execution of your script you can remove it and check count of processed files directly in the target folder.

running
subprocess.Popen(systemLine)
instead of
os.system(systemLine)
should speed things up. Please note that systemLine has to be a list of strings e.g ['some_curl_command', 'source', 'target'] in order to work. If you want to limit the number of concurrent commands have a look at that.

python: unable to find files in recently changed directory (OSx)

I'm automating some tedious shell tasks, mostly file conversions, in a kind of blunt force way with os.system calls (Python 2.7). For some bizarre reason, however, my running interpreter doesn't seem to be able to find the files that I just created.
Example code:
import os, time, glob
# call a node script to template a word document
os.system('node wordcv.js')
# print the resulting document to pdf
os.system('launch -p gowdercv.docx')
# move to the directory that pdfwriter prints to
os.chdir('/users/shared/PDFwriter/pauliglot')
print glob.glob('*.pdf')
I expect to have a length 1 list with the resulting filename, instead I get an empty list.
The same occurs with
pdfs = [file for file in os.listdir('/users/shared/PDFwriter/pauliglot') if file.endswith(".pdf")]
print pdfs
I've checked by hand, and the expected files are actually where they're supposed to be.
Also, I was under the impression that os.system blocked, but just in case it doesn't, I also stuck a time.sleep(1) in there before looking for the files. (That's more than enough time for the other tasks to finish.) Still nothing.
Hmm. Help? Thanks!

You should add a wait after the call to launch. Launch will spawn the task in the background and return before the document is finished printing. You can either put in some arbitrary sleep statements or if you want you can also check for file existence if you know what the expected filename will be.
import time
# print the resulting document to pdf
os.system('launch -p gowdercv.docx')
# give word about 30 seconds to finish printing the document
time.sleep(30)
Alternative:
import time
# print the resulting document to pdf
os.system('launch -p gowdercv.docx')
# wait for a maximum of 90 seconds
for x in xrange(0, 90):
time.sleep(1)
if os.path.exists('/path/to/expected/filename'):
break
Reference for potentially needing a longer than 1 second wait here

Running python cgi script Interpreter results differ to browser

I was having difficulty converting a program I made to a cgi script. I suspected it was to do with os.walk so I made a smaller test script to test this.
(I noticed the single \ before the D in the variable loc and tried changing that to a double \ still no change)
Produces no errors cant tell why it doesn't run the for loop with os.walk in the browser.
I tried adding some data into s and run for loop printing of contents of it and that worked fine, but trying to do it on os.walk I can't seem to get it to work. I can't find anything relating to the issue on google or stackoverflow.
Below is the code:
import cgi,cgitb,os
loc = "C:\\Users\\wen\Desktop\\sample data\\old py stuff\\"
cgitb.enable(display=1,logdir=loc)
s = []
print("Content-type:text/html\r\n\r\n")
print("<html>")
print("<body>")
print("<p>"+loc+"</p>")
for r,ds,fs in os.walk(loc):
print("<p>omgwtf</p>")
for f in fs:
s.append(f)
for i in s:
print("<p>"+i+"</p>")
print("</body>")
print("</html>")
Took a screenshot, the output in interpreter on the left and browser on right
i.imgur.com/136y1Yq.jpg
webserver is running iis7

I'm pretty sure I've solved the problem, I needed to give the folders permissions for 'Authenticated users'.

Python telnetlib not reading everything

I'm trying to automate the download of Argos data using Python's telnetlib, but I can't seem to figure out how to get it to download all of the output. Part of my problem may be that I don't really understand the seemingly asynchronous nature of the commands.
Here's the code:
tn = telnetlib.Telnet(host = HOST, timeout = 60)
with open("argos_prv_{0}-1.txt".format(now_str), 'w') as of:
tn.read_until("Username: ")
tn.write(user + "\n")
tn.read_until("Password: ")
tn.write(password + "\n")
tn.read_until("/")
# Here's the command I'm trying to get the results of:
tn.write("prv,,ds,{0:d},009919,009920\n".format(start_doy))
# At this point, it's presumably dumped it all
tn.read_until("ARGOS READY")
tn.read_until("/")
# Logging out
tn.write("lo\n")
lines = tn.read_all()
of.write(lines)
of.flush()
The code seems to run just fine, but when I look at the output file, it never has everything in it, cutting out at some random point. When I type the same commands in a real telnet session, it works just fine.
I get the sense it has something to do with trying to read_all() after logging out (tn.write("lo\n")), but when I look at the example documentation for telnetlib, it pretty much looks just like this.
Anyway, my question is: can anyone see what I'm doing wrong here? I want to grab the results of the prv,,ds command, but I'm only getting some of it using this particular code.
Thanks.

# At this point, it's presumably dumped it all
tn.read_until("ARGOS READY")
tn.read_until("/")
At a guess, this bit is sucking up the data and doing nothing with it. Think of it like a pair of pipes - you send stuff one way with write, and pull stuff back with read_*. If you've already sucked the stuff up, it won't still be waiting in the pipe when you do read_all later.
EDIT:
OK, you're seeing a different problem. Try this:
lines = tn.read_until("ARGOS READY")
lines += tn.read_until("/")
tn.write("lo\n")
# Write out lines to file.

Python urllib2 file upload problems

I'm currently trying to initiate a file upload with urllib2 and the urllib2_file library. Here's my code:
import sys
import urllib2_file
import urllib2
URL='http://aquate.us/upload.php'
d = [('uploaded', open(sys.argv[1:]))]
req = urllib2.Request(URL, d)
u = urllib2.urlopen(req)
print u.read()
I've placed this .py file in my My Documents directory and placed a shortcut to it in my Send To folder (the shortcut URL is ).
When I right click a file, choose Send To, and select Aquate (my python), it opens a command prompt for a split second and then closes it. Nothing gets uploaded.
I knew there was probably an error going on so I typed the code into CL python, line by line.
When I ran the u=urllib2.urlopen(req) line, I didn't get an error;
alt text http://www.aquate.us/u/55245858877937182052.jpg
instead, the cursor simply started blinking on a new line beneath that line. I waited a couple of minutes to see if something would happen but it just stayed like that. To get it to stop, I had to press ctrl+break.
What's up with this script?
Thanks in advance!
[Edit]
Forgot to mention -- when I ran the script without the request data (the file) it ran like a charm. Is it a problem with urllib2_file?
[edit 2]:
import MultipartPostHandler, urllib2, cookielib,sys
import win32clipboard as w
cookies = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),MultipartPostHandler.MultipartPostHandler)
params = {"uploaded" : open("c:/cfoot.js") }
a=opener.open("http://www.aquate.us/upload.php", params)
text = a.read()
w.OpenClipboard()
w.EmptyClipboard()
w.SetClipboardText(text)
w.CloseClipboard()
That code works like a charm if you run it through the command line.

If you're using Python 2.5 or newer, urllib2_file is both unnecessary and unsupported, so check which version you're using (and perhaps upgrade).
If you're using Python 2.3 or 2.4 (the only versions supported by urllib2_file), try running the sample code and see if you have the same problem. If so, there is likely something wrong either with your Python or urllib2_file installation.
EDIT:
Also, you don't seem to be using either of urllib2_file's two supported formats for POST data. Try using one of the following two lines instead:
d = ['uploaded', open(sys.argv[1:])]
## --OR-- ##
d = {'uploaded': open(sys.argv[1:])}

First, there's a third way to run Python programs.
From cmd.exe, type python myprogram.py. You get a nice log. You don't have to type stuff one line at a time.
Second, check the urrlib2 documentation. You'll need to look at urllib, also.
A Request requires a URL and a urlencoded encoded buffer of data.
data should be a buffer in the
standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format.
You need to encode your data.

If you're still on Python2.5, what worked for me was to download the code here:
http://peerit.blogspot.com/2007/07/multipartposthandler-doesnt-work-for.html
and save it as MultipartPostHandler.py
then use:
import urllib2, MultipartPostHandler
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})
or if you need cookies:
import urllib2, MultipartPostHandler, cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python urllib retrieve is slow - python

A great, easy, and accepted library for HTTP requests is requests. Read the requests documentation.

Related

How can I speed up this python script to read and process a csv file?

python: unable to find files in recently changed directory (OSx)

Running python cgi script Interpreter results differ to browser

Python telnetlib not reading everything

Python urllib2 file upload problems

Categories

Resources