I have a list with 300 million url's .I need to invoke async rest api calls with this url's.I dont require the responses.
I tried to implement this with twisted.when the list grows with more than 1000 url's I am getting error.Please suggest me how this could be achieved
Please find my code
# start of my program
from twisted.web import client
from twisted.internet import reactor, defer
#list of urls to be invoked
urls = [
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808'
]
#list of urls
#the call back
def finish(results):
for result in results:
print 'GOT PAGE', len(result), 'bytes'
reactor.stop()
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
The first issue, given the source provided, is that 300 million URL strings will take a lot of RAM. Keep in mind each string has overhead above and beyond the bytes and the combination of the strings into a list will likely require re-allocations.
In addition, I think the subtle bug here is that you're trying to accumulate the results into a list with waiting = [ ... ]. I suspect you really meant that you wanted an iterator that fed gatherResults().
To remedy both these ills, write your file into "urls.txt" and try the following instead (also drop the bit with urls = [...]):
import sys.stdin
waiting = (client.getPage(url.strip() for url in sys.stdin)
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
Simply run using python script.py <urls.txt
The difference between [...] and (...) is quite large. [...] runs the ... part immediately, creating a giant list of the results; (...) creates a generator that will yield one result for each iteration in the ...
Note: I have not had a chance to test any of that (I don't use Twisted much) but from what you posted, these changes should help your RAM issue
Related
I subscribe to a real time stream which publishes a small JSON record at a slow rate (0.5 KBs every 1-5 seconds). The publisher has provided a python client that exposes these records. I write these records to a list in memory. The client is just a python wrapper for doing a curl command on a HTTPS endpoint for a dataset. A dataset is defined by filters and fields. I can let the client go for a few days and stop it at midnight to process multiple days worth of data as one batch.
Instead of multi-day batches described above, I'd like to write every n-records by treating the stream as a generator. The client code is below. I just added the append() line to create a list called 'records' (in memory) to playback later:
records=[]
data_set = api.get_dataset(dataset_id='abc')
for record in data_set.request_realtime():
records.append(record)
which as expected, gives me [*] in Jupyter Notebook; and keeps running.
Then, I created a generator from my list in memory as follows to extract one record (n=1 for initial testing):
def Generator():
count = 1
while count < 2:
for r in records:
yield r.data
count +=1
But my generator definition also gave me [*] and kept calculating; which I understand it is because the list is still being written in memory. But I thought my generator would be able to lock the state of my list and yield the first n-records. But it didn't. How can I code my generator in this case? And if a generator is not a good choice in this use case, please advise.
To give you the full picture, if my code was working, then, I'd have instantiated it, printed it, and received an object as expected like this:
>>>my_generator = Generator()
>>>print(my_generator)
<generator object Gen at 0x0000000009910510>
Then, I'd have written it to a csv file like so:
with open('myfile.txt', 'w') as f:
cf = csv.DictWriter(f, column_headers, extrasaction='ignore')
cf.writeheader()
cf.writerows(i.data for i in my_generator)
Note: I know there are many tools for this e.g. Kafka; but I am in an initial PoC phase. Please use Python 2x. Once I get my code working, I plan on stacking generators to set up my next n-record extraction so that I don't lose data in between. Any guidance on stacking would also be appreciated.
That's not how concurrency works. Unless some magic is being used that you didn't tell us about, while your first code returns * you can't run more code. Putting the generator in another cell just adds it to a queue to run when the first code finishes - since the first code will never finish, the second code will never even start running!
I suggest looking into some asynchronous networking library, like asyncio, twisted or trio. They allow you to make functions cooperative so while one of them is waiting for data, the other can run, instead of blocking. You'd probably have to rewrite the api.get_dataset code to be asynchronous as well.
I'm new to Tornado so I wanted to know if the code below is a correct way to approach the problem or there is a better one. It works but I'm not sure regarding its efficiency.
The code is based on the documentation here
In the middle of my script, I need to run HTTP Requests (10-50). Apparently, it is possible to do this in parallel this way :
#gen.coroutine
def parallel_fetch_many(urls):
responses = yield [http_client.fetch(url) for url in urls]
# responses is a list of HTTPResponses in the same order
How do I access responses after the coroutine is done ? Can I just add return responses ?
Also, as I only need to use an async process once in my code, I start the IOLoop this way :
# run_sync() doesn't take arguments, so we must wrap the
# call in a lambda.
IOLoop.current().run_sync(lambda: parallel_fetch_many(googleLinks))
Is it correct to do it this way ? Or should I just start the IOLoop at the beginning of the script and stop it a the end, even though I only use an async process once.
Basically, my question is : Is the code below correct ?
#gen.coroutine
def parallel_fetch_many(urls):
responses = yield [http_client.fetch(url) for url in urls]
return responses
googleLinks = [url1,url2,...,urln]
responses = IOLoop.current().run_sync(lambda:parallel_fetch_many(googleLinks))
do_something(responses)
Yes, your code looks correct to me.
I'm making a SocketServer that will need to be able to handle a lot of commands. So to keep my RequestHandler from becoming too long it will call different functions depening on the command. My dilemma is how to make it send info back to the client.
Currently I'm making the functions "yield" everything it wants to send back to the client. But I'm thinking it's probably not the pythonic way.
# RequestHandler
func = __commands__.get(command, unkown_command)
for message in func():
self.send(message)
# example_func
def example():
yield 'ip: {}'.format(ip)
yield 'count: {}'.format(count)
. . .
for ping in pinger(ip,count):
yield ping
Is this an ugly use of yield? The only alterative I can think of is if when the RequestHandler calls the function is passes itself as an argument
func(self)
and then in the function
def example(handler):
. . .
handler.send('ip: {}'.format(ip))
But this way doesn't feel much better.
def example():
yield 'ip: {}'.format(ip)
yield 'count: {}'.format(count)
What strikes me as strange in this solution is not the use of yield itself (which can be perfectly valid) but the fact that you're losing a lot of information by turning your data into strings prematurely.
In particular, for this kind of data, simply returning a dictionary and handling the sending in the caller seems more readable:
def example():
return {'ip': ip, 'count': count}
This also helps you separate content and presentation, which might be useful if you want, for example, to return data encoded in XML but later switch to JSON.
If you want to yield intermediate data, another possibility is using tuples: yield ('ip', ip). In this way you keep the original data and can start processing the values immediately outside the function
I do the same as you with yield. The reason for this is simple:
With yield the main loop can easily handle the case that sending data to one socket will block. Each socket gets a buffer for outgoing data that you fill with the yield. The main loop tries to send as much of that as possible to the socket and when it blocks it records how far it got in the buffer and waits for the socket to be ready for more. When the buffer is empty is runs next(func) to get the next chunk of data.
I don't see how you would do that with handler.send('ip: {}'.format(ip)). When that socket blocks you are stuck. You can't pause that send and handle other sockets easily.
Now for this to be useful there are some assumptions:
the data each yield sends is considerable and you don't want to generate all of it into one massive buffer ahead of time
generating the data for each yield takes time and you want to already send the finished parts
you want to use reply = yield data waiting for the peer to respond to the data in some way. Yes, you can make this a back and forth. next(func) becomes func.send(reply).
Any of these is a good reason to go the yield way or coroutines in general. The alternative seems to be to use one thread per socket.
Note: the func can also call other generators using yield from. Makes it easy to split a large problem into smaller handlers and to share common parts.
I have a python script that pulls from various internal network sources. With how our systems are set up we will initiate a urllib pull from a network location and it will get hung up waiting forever for a response on certain parts of the network. I would like my script to check that if it hasnt finished the pull in lets say 5 minutes it will pass the function and attempt to pull from the next address, and record it to a bad directory repository(so we can go check out which systems get hung up, there's like over 20,000 IP addresses we are checking some with some older scripts running on them that no longer work but will still try and run when requested, and they never stop trying to run)
Im familiar with having a script pause at a certain point
import time
time.sleep(300)
What Im thinking from a psuedo code perspective (not proper python just illustrating the idea)
import time
import urllib2
url_dict = ['http://1', 'http://2', 'http://3', ...]
fail_log_path = 'C:/Temp/fail_log.txt'
for addresses in url_dict:
clock_value = time.start()
while clock_value <= 300:
print str(clock_value)
res = urllib2.retrieve(url)
if res != []:
pass
else:
fail_log = open(fail_log_path, 'a')
fail_log.write("Failed to pull from site location: " + str(url) + "\n")
faile_log.close
Update: a specific option for this dealing with urls timeout for urllib2.urlopen() in pre Python 2.6 versions
Found this answer which is more in line with the overall problem of my question:
kill a function after a certain time in windows
Your code as is doesn't seem to describe what you were saying. It seems you want the if/else check inside your while loop. On top of that, you would want to loop over the ip addresses and not over a time period as your code is currently written (otherwise you will keep requesting the same ip address every time). Instead of keeping track of time yourself, I would suggest reading up on urllib.request.urlopen - specifically the timeout parameter. Once set, that function call will throw a socket.timeout exception once the time limit is reached. Surround that with a try/except block catching that error and then handle it appropriately.
I've made a script to get inventory data from the Steam API and I'm a bit unsatisfied with the speed. So I read a bit about multiprocessing in python and simply cannot wrap my head around it. The program works as such: it gets the SteamID from a list, gets the inventory and then appends the SteamID and the inventory in a dictionary with the ID as the key and inventory contents as the value.
I've also understood that there are some issues involved with using a counter when multiprocessing, which is a small problem as I'd like to be able to resume the program from the last fetched inventory rather than from the beginning again.
Anyway, what I'm asking for is really a concrete example of how to do multiprocessing when opening the URL that contains the inventory data so that the program can fetch more than one inventory at a time rather than just one.
onto the code:
with open("index_to_name.json", "r", encoding=("utf-8")) as fp:
index_to_name=json.load(fp)
with open("index_to_quality.json", "r", encoding=("utf-8")) as fp:
index_to_quality=json.load(fp)
with open("index_to_name_no_the.json", "r", encoding=("utf-8")) as fp:
index_to_name_no_the=json.load(fp)
with open("steamprofiler.json", "r", encoding=("utf-8")) as fp:
steamprofiler=json.load(fp)
with open("itemdb.json", "r", encoding=("utf-8")) as fp:
players=json.load(fp)
error=list()
playerinventories=dict()
c=127480
while c<len(steamprofiler):
inventory=dict()
items=list()
try:
url=urllib.request.urlopen("http://api.steampowered.com/IEconItems_440/GetPlayerItems/v0001/?key=DD5180808208B830FCA60D0BDFD27E27&steamid="+steamprofiler[c]+"&format=json")
inv=json.loads(url.read().decode("utf-8"))
url.close()
except (urllib.error.HTTPError, urllib.error.URLError, socket.error, UnicodeDecodeError) as e:
c+=1
print("HTTP-error, continuing")
error.append(c)
continue
try:
for r in inv["result"]["items"]:
inventory[r["id"]]=r["quality"], r["defindex"]
except KeyError:
c+=1
error.append(c)
continue
for key in inventory:
try:
if index_to_quality[str(inventory[key][0])]=="":
items.append(
index_to_quality[str(inventory[key][0])]
+""+
index_to_name[str(inventory[key][1])]
)
else:
items.append(
index_to_quality[str(inventory[key][0])]
+" "+
index_to_name_no_the[str(inventory[key][1])]
)
except KeyError:
print("keyerror, uppdate def_to_index")
c+=1
error.append(c)
continue
playerinventories[int(steamprofiler[c])]=items
c+=1
if c % 10==0:
print(c, "inventories downloaded")
I hope my problem was clear, otherwise just say so obviously. I would optimally avoid using 3rd party libraries but if it's not possible it's not possible. Thanks in advance
So you're assuming the fetching of the URL might be the thing slowing your program down? You'd do well to check that assumption first, but if it's indeed the case using the multiprocessing module is a huge overkill: for I/O bound bottlenecks threading is quite a bit simpler and might even be a bit faster (it takes a lot more time to spawn another python interpreter than to spawn a thread).
Looking at your code, you might get away with sticking most of the content of your while loop in a function with c as a parameter, and starting a thread from there using another function, something like:
def process_item(c):
# The work goes here
# Replace al those 'continue' statements with 'return'
for c in range(127480, len(steamprofiler)):
thread = threading.Thread(name="inventory {0}".format(c), target=process_item, args=[c])
thread.start()
A real problem might be that there's no limit to the amount of threads being spawned, which might break the program. Also the guys at Steam might not be amused at getting hammered by your script, and they might decide to un-friend you.
A better approach would be to fill a collections.deque object with your list of c's and then start a limited set of threads to do the work:
def process_item(c):
# The work goes here
# Replace al those 'continue' statements with 'return'
def process():
while True:
process_item(work.popleft())
work = collections.deque(range(127480, len(steamprofiler)))
threads = [threading.Thread(name="worker {0}".format(n), target=process)
for n in range(6)]
for worker in threads:
worker.start()
Note that I'm counting on work.popleft() to throw an IndexError when we're out of work, which will kill the thread. That's a bit sneaky, so consider using a try...except instead.
Two more things:
Consider using the excellent Requests library instead of urllib (which, API-wise, is by far the worst module in the entire Python standard library that I've worked with).
For Requests, there's an add-on called grequests which allows you to do fully asynchronous HTTP-requests. That would have made for even simpler code.
I hope this helps, but please keep in mind this is all untested code.
The outermost while loop seems to be distributed over a few processes(or tasks).
When you break the loop into tasks, note that you are sharing playerinventories and error object between processes. You will need to use multiprocessing.Manager for the sharing issue.
I recommend you to start modifying your code from this snippet.