I have a list with 300 million url's .I need to invoke async rest api calls with this url's.I dont require the responses.
I tried to implement this with twisted.when the list grows with more than 1000 url's I am getting error.Please suggest me how this could be achieved
Please find my code
# start of my program
from twisted.web import client
from twisted.internet import reactor, defer
#list of urls to be invoked
urls = [
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808'
]
#list of urls
#the call back
def finish(results):
for result in results:
print 'GOT PAGE', len(result), 'bytes'
reactor.stop()
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
The first issue, given the source provided, is that 300 million URL strings will take a lot of RAM. Keep in mind each string has overhead above and beyond the bytes and the combination of the strings into a list will likely require re-allocations.
In addition, I think the subtle bug here is that you're trying to accumulate the results into a list with waiting = [ ... ]. I suspect you really meant that you wanted an iterator that fed gatherResults().
To remedy both these ills, write your file into "urls.txt" and try the following instead (also drop the bit with urls = [...]):
import sys.stdin
waiting = (client.getPage(url.strip() for url in sys.stdin)
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
Simply run using python script.py <urls.txt
The difference between [...] and (...) is quite large. [...] runs the ... part immediately, creating a giant list of the results; (...) creates a generator that will yield one result for each iteration in the ...
Note: I have not had a chance to test any of that (I don't use Twisted much) but from what you posted, these changes should help your RAM issue
I have been trying to work with the standard GPS (gps.py) module in python 2.6. This is supposed to act as a client and read GPS Data from gpsd running in Ubuntu.
According to the documentation from GPSD webpage on client design (GPSD Client Howto), I should be able to use the following code (slightly modified from the example) for getting latest GPS Readings (lat long is what I am mainly interested in)
from gps import *
session = gps() # assuming gpsd running with default options on port 2947
session.stream(WATCH_ENABLE|WATCH_NEWSTYLE)
report = session.next()
print report
If I repeatedly use the next() it gives me buffered values from the bottom of the queue (from when the session was started), and not the LATEST Gps reading. Is there a way to get more recent values using this library? In a Way, seek the Stream to the latest values?
Has anyone got a code example using this library to poll the gps and get the value i am looking for ?
Here is what I am trying to do:
start the session
Wait for user to call the gps_poll() method in my code
Inside this method read the latest TPV (Time Position Velocity) report and return lat long
Go back to waiting for user to call gps_poll()
What you need to do is regularly poll 'session.next()' - the issue here is that you're dealing with a serial interface - you get results in the order they were received. Its up to you to maintain a 'current_value' that has the latest retrieved value.
If you don't poll the session object, eventually your UART FIFO will fill up and you won't get any new values anyway.
Consider using a thread for this, don't wait for the user to call gps_poll(), you should be polling and when the user wants a new value they use 'get_current_value()' which returns current_value.
Off the top of my head it could be something as simple as this:
import threading
import time
from gps import *
class GpsPoller(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.session = gps(mode=WATCH_ENABLE)
self.current_value = None
def get_current_value(self):
return self.current_value
def run(self):
try:
while True:
self.current_value = self.session.next()
time.sleep(0.2) # tune this, you might not get values that quickly
except StopIteration:
pass
if __name__ == '__main__':
gpsp = GpsPoller()
gpsp.start()
# gpsp now polls every .2 seconds for new data, storing it in self.current_value
while 1:
# In the main thread, every 5 seconds print the current value
time.sleep(5)
print gpsp.get_current_value()
The above answers are very inefficient and overly complex for anyone using modern versions of gpsd and needing data at only specific times, instead of streaming.
Most GPSes send their position information at least once per second. Presumably since many GPS-based applications desire real-time updates, the vast majority of gpsd client examples I've seen use the above method of watching a stream from gpsd and receiving realtime updates (more or less as often as the gps sends them).
However, if (as in the OP's case) you don't need streaming information but just need the last-reported position whenever it's requested (i.e. via user interaction or some other event), there's a much more efficient and simpler method: let gpsd cache the latest position information, and query it when needed.
The gpsd JSON protocol has a ?POLL; request, which returns the most recent GPS information that gpsd has seen. Instead of having to iterate over the backlog of gps messages, and continually read new messages to avoid full buffers, you can send a ?WATCH={"enable":true} message at the start of the gpsd session, and then query the latest position information whenever you need it with ?POLL;. The response is a single JSON object containing the most recent information that gpsd has seen from the GPS.
If you're using Python3, the easiest way I've found is to use the gpsd-py3 package available on pypi. To connect to gpsd, get the latest position information, and print the current position:
import gpsd
gpsd.connect()
packet = gpsd.get_current()
print(packet.position())
You can repeat the gpsd.get_current() call whenever you want new position information, and behind the scenes the gpsd package will execute the ?POLL; call to gpsd and return an object representing the response.
Doing this with the built-in gps module isn't terribly straightforward, but there are a number of other Python clients available, and it's also rather trivial to do with anything that can perform socket communication, including this example using telnet:
$ telnet localhost 2947
Trying ::1...
Connected to localhost.
Escape character is '^]'.
{"class":"VERSION","release":"3.16","rev":"3.16","proto_major":3,"proto_minor":11}
?WATCH={"enable":true}
{"class":"DEVICES","devices":[{"class":"DEVICE","path":"/dev/pts/10","driver":"SiRF","activated":"2018-03-02T21:14:52.687Z","flags":1,"native":1,"bps":4800,"parity":"N","stopbits":1,"cycle":1.00}]}
{"class":"WATCH","enable":true,"json":false,"nmea":false,"raw":0,"scaled":false,"timing":false,"split24":false,"pps":false}
?POLL;
{"class":"POLL","time":"2018-03-02T21:14:54.873Z","active":1,"tpv":[{"class":"TPV","device":"/dev/pts/10","mode":3,"time":"2005-06-09T14:34:53.280Z","ept":0.005,"lat":46.498332203,"lon":7.567403907,"alt":1343.165,"epx":24.829,"epy":25.326,"epv":78.615,"track":10.3788,"speed":0.091,"climb":-0.085,"eps":50.65,"epc":157.23}],"gst":[{"class":"GST","device":"/dev/pts/10","time":"1970-01-01T00:00:00.000Z","rms":0.000,"major":0.000,"minor":0.000,"orient":0.000,"lat":0.000,"lon":0.000,"alt":0.000}],"sky":[{"class":"SKY","device":"/dev/pts/10","time":"2005-06-09T14:34:53.280Z","xdop":1.66,"ydop":1.69,"vdop":3.42,"tdop":3.05,"hdop":2.40,"gdop":5.15,"pdop":4.16,"satellites":[{"PRN":23,"el":6,"az":84,"ss":0,"used":false},{"PRN":28,"el":7,"az":160,"ss":0,"used":false},{"PRN":8,"el":66,"az":189,"ss":45,"used":true},{"PRN":29,"el":13,"az":273,"ss":0,"used":false},{"PRN":10,"el":51,"az":304,"ss":29,"used":true},{"PRN":4,"el":15,"az":199,"ss":36,"used":true},{"PRN":2,"el":34,"az":241,"ss":41,"used":true},{"PRN":27,"el":71,"az":76,"ss":42,"used":true}]}]}
?POLL;
{"class":"POLL","time":"2018-03-02T21:14:58.856Z","active":1,"tpv":[{"class":"TPV","device":"/dev/pts/10","mode":3,"time":"2005-06-09T14:34:53.280Z","ept":0.005,"lat":46.498332203,"lon":7.567403907,"alt":1343.165,"epx":24.829,"epy":25.326,"epv":78.615,"track":10.3788,"speed":0.091,"climb":-0.085,"eps":50.65,"epc":157.23}],"gst":[{"class":"GST","device":"/dev/pts/10","time":"1970-01-01T00:00:00.000Z","rms":0.000,"major":0.000,"minor":0.000,"orient":0.000,"lat":0.000,"lon":0.000,"alt":0.000}],"sky":[{"class":"SKY","device":"/dev/pts/10","time":"2005-06-09T14:34:53.280Z","xdop":1.66,"ydop":1.69,"vdop":3.42,"tdop":3.05,"hdop":2.40,"gdop":5.15,"pdop":4.16,"satellites":[{"PRN":23,"el":6,"az":84,"ss":0,"used":false},{"PRN":28,"el":7,"az":160,"ss":0,"used":false},{"PRN":8,"el":66,"az":189,"ss":45,"used":true},{"PRN":29,"el":13,"az":273,"ss":0,"used":false},{"PRN":10,"el":51,"az":304,"ss":29,"used":true},{"PRN":4,"el":15,"az":199,"ss":36,"used":true},{"PRN":2,"el":34,"az":241,"ss":41,"used":true},{"PRN":27,"el":71,"az":76,"ss":42,"used":true}]}]}
Adding my two cents.
For whatever reason my raspberry pi would continue to execute a thread and I'd have to hard reset the pi.
So I've combined sysnthesizerpatel and an answer I found on Dan Mandel's blog here.
My gps_poller class looks like this:
import os
from gps import *
from time import *
import time
import threading
class GpsPoller(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.session = gps(mode=WATCH_ENABLE)
self.current_value = None
self.running = True
def get_current_value(self):
return self.current_value
def run(self):
try:
while self.running:
self.current_value = self.session.next()
except StopIteration:
pass
And the code in use looks like this:
from gps_poll import *
if __name__ == '__main__':
gpsp = GpsPoller()
try:
gpsp.start()
while True:
os.system('clear')
report = gpsp.get_current_value()
# print report
try:
if report.keys()[0] == 'epx':
print report['lat']
print report['lon']
time.sleep(.5)
except(AttributeError, KeyError):
pass
time.sleep(0.5)
except(KeyboardInterrupt, SystemExit):
print "\nKilling Thread.."
gpsp.running = False
gpsp.join()
print "Done.\nExiting."
You can also find the code here: Here and Here
I know its an old thread but just for everyone understanding, you can also use pyembedded python library for this.
pip install pyembedded
from pyembedded.gps_module.gps import GPS
import time
gps = GPS(port='COM3', baud_rate=9600)
while True:
print(gps.get_lat_long())
time.sleep(1)
https://pypi.org/project/pyembedded/
I currently have a Python application where newline-terminated ASCII strings are being transmitted to me via a TCP/IP socket. I have a high data rate of these strings and I need to parse them as quickly as possible. Currently, the strings are being transmitted as CSV and if the data rate is high enough, my Python application starts to lag behind the input data rate (probably not all that surprising).
The strings look something like this:
chan,2007-07-13T23:24:40.143,0,0188878425-079,0,0,True,S-4001,UNSIGNED_INT,name1,module1,...
I have a corresponding object that will parse these strings and store all of the data into an object. Currently the object looks something like this:
class ChanVal(object):
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
for key in kwargs:
setattr(self,key,kwargs[key])
def parseFromCsv(self, csvString):
lst = csvString.split(',')
self.eventTime=lst[1]
self.eventTimeExact=long(lst[2])
self.other_clock=lst[3]
...
To read the data in from the socket, I'm using a basic "socket.socket(socket.AF_INET,socket.SOCK_STREAM)" (my app is the server socket) and then I'm using the "select.poll()" object from the "select" module to constantly poll the socket for new input using its "poll(...)" method.
I have some control over the process sending the data (meaning I can get the sender to change the format), but it would be really convenient if we could speed up the ASCII processing enough to not have to use fixed-width or binary formats for the data.
So up until now, here are the things I've tried and haven't really made much of a difference:
Using the string "split" method and then indexing the list of results directly (see above), but "split" seems to be really slow.
Using the "reader" object in the "csv" module to parse the strings
Changing the strings being sent to a string format that I can use to directly instantiate an object via "eval" (e.g. sending something like "ChanVal(eventTime='2007-07-13T23:24:40.143',eventTimeExact=0,...)")
I'm trying to avoid going to a fixed-width or binary format, though I realize those would probably ultimately be much faster.
Ultimately, I'm open to suggestions on better ways to poll the socket, better ways to format/parse the data (though hopefully we can stick with ASCII) or anything else you can think of.
Thanks!
You can't make Python faster. But you can make your Python application faster.
Principle 1: Do Less.
You can't do less input parsing over all but you can do less input parsing in the process that's also reading the socket and doing everything else with the data.
Generally, do this.
Break your application into a pipeline of discrete steps.
Read the socket, break into fields, create a named tuple, write the tuple to a pipe with something like pickle.
Read a pipe (with pickle) to construct the named tuple, do some processing, write to another pipe.
Read a pipe, do some processing, write to a file or something.
Each of these three processes, connected with OS pipes, runs concurrently. That means that the first process is reading the socket and make tuples while the second process is consuming tuples and doing calculations while the third process is doing calculations and writing a file.
This kind of pipeline maximizes what your CPU can do. Without too many painful tricks.
Reading and writing to pipes is trivial, since linux assures you that sys.stdin and sys.stdout will be pipes when the shell creates the pipeline.
Before doing anything else, break your program into pipeline stages.
proc1.py
import cPickle
from collections import namedtuple
ChanVal= namedtuple( 'ChanVal', ['eventTime','eventTimeExact', 'other_clock', ... ] )
for line socket:
c= ChanVal( **line.split(',') )
cPickle.dump( sys.stdout )
proc2.py
import cPickle
from collections import namedtuple
ChanVal= namedtuple( 'ChanVal', ['eventTime','eventTimeExact', 'other_clock', ... ] )
while True:
item = cPickle.load( sys.stdin )
# processing
cPickle.dump( sys.stdout )
This idea of processing namedtuples through a pipeline is very scalable.
python proc1.py | python proc2.py
You need to profile your code to find out where the time is being spent.
That doesn't necessarily mean using python's profiler
For example you can just try parsing the same csv string 1000000 times with different methods. Choose the fastest method - divide by 1000000 now you know how much CPU time it takes to parse a string
Try to break the program into parts and work out how what resources are really required by each part.
The parts that need the most CPU per input line are your bottle necks
On my computer, the program below outputs this
ChanVal0 took 0.210402965546 seconds
ChanVal1 took 0.350302934647 seconds
ChanVal2 took 0.558166980743 seconds
ChanVal3 took 0.691503047943 seconds
So you see that about half the time there is taken up by parseFromCsv. But also that quite a lot of time is taken extracting the values and storing them in the class.
If the class isn't used right away it might be faster to store the raw data and use properties to parse the csvString on demand.
from time import time
import re
class ChanVal0(object):
def __init__(self, csvString=None,**kwargs):
self.csvString=csvString
for key in kwargs:
setattr(self,key,kwargs[key])
class ChanVal1(object):
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
for key in kwargs:
setattr(self,key,kwargs[key])
def parseFromCsv(self, csvString):
self.lst = csvString.split(',')
class ChanVal2(object):
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
for key in kwargs:
setattr(self,key,kwargs[key])
def parseFromCsv(self, csvString):
lst = csvString.split(',')
self.eventTime=lst[1]
self.eventTimeExact=long(lst[2])
self.other_clock=lst[3]
class ChanVal3(object):
splitter=re.compile("[^,]*,(?P<eventTime>[^,]*),(?P<eventTimeExact>[^,]*),(?P<other_clock>[^,]*)")
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
self.__dict__.update(kwargs)
def parseFromCsv(self, csvString):
self.__dict__.update(self.splitter.match(csvString).groupdict())
s="chan,2007-07-13T23:24:40.143,0,0188878425-079,0,0,True,S-4001,UNSIGNED_INT,name1,module1"
RUNS=100000
for cls in ChanVal0, ChanVal1, ChanVal2, ChanVal3:
start_time = time()
for i in xrange(RUNS):
cls(s)
print "%s took %s seconds"%(cls.__name__, time()-start_time)