Python efficiency for iterating through large lists/arrays - python

I will be attaching a client to listen in on an OPC server comprising of thousands of nodes that reads mechanical data from various sensors. The code for the client will be written in some version of Python3.
So essentially, the task comes down to:
Connect to the server endpoint
Iterate through all the nodes and pick up its values from the server
Store the read values in some format (not decided yet)
I've written a basic sample code, just to add some reference:
import sys
import time
import datetime
import asyncio
import json
from opcua import Client
endpoint = "opc.tcp://some-server-endpoint:portid/foobar"
# In actual DEV code, there will be about 100_000 nodes.
# Will make reading the nodes into a separate loop function
# to avoid hard coding all the node IDs.
nodes = [
"i=1001",
"i=1002",
"i=1003"
]
# Connect to Endpoint (Probably will wrap this into a
# function in DEV as well but need to look at how the
# library dependencies handle the connection)
try:
server = Client(endpoint)
server.connect()
print("Connection Success.")
except Exception as err:
print("Connection Error:", err)
sys.exit(1)
# Function to return node values, nodeID and timestamp
async def read_node(arg) -> str:
node_conn = server.get_node(arg)
node_value = node_conn.get_value()
output = {
"nodeID": str(arg),
"value": str(node_value),
"timestamp": str(datetime.datetime.now().strftime("%Y%m%d%H%M%S"))
}
global output_log
output_log = json.dumps(output)
print(output_log)
# Returns output to JSON file (For the sake of practice,
# just made it to dump all outputs into a single JSON
# file for now)
def json_convert(data) -> str:
with open("opc_data.json", "a") as outfile:
json.dump(data, outfile, indent=2)
# Loops through nodes list, outputs to json file
async def main() -> None:
while True:
for i in range(len(nodes)):
async_read = await read_node(nodes[i])
json_convert(output_log)
time.sleep(1)
if __name__ == "__main__":
asyncio.run(main())
I am not looking for a code review (although all suggestions are welcome!!).
My primary question is how to maximize the efficiency of having to loop through thousands of items in the form of a list or array when data needs to be read consistently and constantly on 1 sec intervals or less.
In the above code, I used async so that each batch of data can be stored independently and together in various storage locations, but there seems to be other options such as multiprocessing or threading also available.
Each option seems to have its pros and cons, and I am not sure how many cores I can end up using for this job which could make things like multiprocessing less efficient.
Currently, the only thing I can think of is maybe using a library like numpy to take advantage of something like np.array.

Related

Python3.8 asyncio Streams and read

This is probably a stupid question, but I'm missing something that I think is probably obvious:
I'm working with the Python 3.8 asyncio streams:
import asyncio
async def client_handler():
reader, writer = await asyncio.open_connection('127.0.0.1', 9128)
server_name = writer.get_extra_info('peername')[0]
print(f"Connecting to: { server_name }")
data = b''
close_client = False
try:
while not close_client:
print(data)
data = await reader.read(1024)
print(data)
if data != b'':
print(data.decode('utf-8'))
data = b''
writer.close()
except:
pass
finally:
writer.close()
await writer.wait_closed()
asyncio.run(client_handler())
I guess I expected that it would try to read 1024 bytes but if there was nothing there, then it would just return None or an empty byte string or something, but instead it just sits there until data is received.
Am I misunderstanding what read is supposed to do? Is there instead another method that I could use to peek into a buffer or poll to see if any data is actually incoming?
For instance, lets say I'm writing an example chat program server and client that need to be able to dynamically send and receive data at the same time... how do I implement that with the asyncio streams? Should I just build my own asyncio.Protocol subclass instead?
You can use tools like asyncio.gather(), asyncio.wait(return_when=FIRST_COMPLETED), and asyncio.wait_for() to multiplex between different operations, such as reads from different streams, or reads and writes. Without additional details regarding your use case it's hard to give you concrete advice how to proceed.
Building an asyncio.Protocol or using feed_eof() directly is almost certainly the wrong approach, unless you are writing very specialized software and know exactly what you are doing.

Using a variable in the later part of the program (python)

I'm trying to connect two separate codes into one program. I need to put one string from first to second part.
First:
import boto3
if __name__ == "__main__":
bucket='BUCKET-NAME'
collectionId='COLLECTION-ID'
fileName='input.jpg'
threshold = 70
maxFaces=1
client=boto3.client('rekognition')
response=client.search_faces_by_image(CollectionId=collectionId,
Image={'S3Object':{'Bucket':bucket,'Name':fileName}},
FaceMatchThreshold=threshold,
MaxFaces=maxFaces)
faceMatches=response['FaceMatches']
for match in faceMatches:
print (match['Face']['FaceId'])
Second:
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('faces')
response = table.scan(
FilterExpression=Attr('faceid').eq('FaceId')
)
items = response['Items']
print(items)
I need to put ID shown by print (match['Face']['FaceId']) from first code to FaceId in second code.
I tried to define a variable and put a value into it and then get it later but I could not do it correctly
Typically, you're write your first block of code as a library/module with a function that does some unit of work and returns the result. Then the second block of code would import the first and call the function.
# lib.py
def SomeFunction(inputs):
output = doSomething(inputs)
return output
# main.py
import lib
data = ...
result = lib.SomeFunction(data)
moreWork(result)
If you want two separate programs that run independently and share data, you want Inter-process communication. You can get processes to share information with each other via: a file/fifo in the filesystem; a network socket; shared memory; and STDIO (and probably more). However, IPC is definitely more work than synchronous library calls.

100,0000 million rest api call to be made in python

I have a list with 300 million url's .I need to invoke async rest api calls with this url's.I dont require the responses.
I tried to implement this with twisted.when the list grows with more than 1000 url's I am getting error.Please suggest me how this could be achieved
Please find my code
# start of my program
from twisted.web import client
from twisted.internet import reactor, defer
#list of urls to be invoked
urls = [
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808',
'http://test.com/apiname/?s=85465&ts=1370591808'
]
#list of urls
#the call back
def finish(results):
for result in results:
print 'GOT PAGE', len(result), 'bytes'
reactor.stop()
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
The first issue, given the source provided, is that 300 million URL strings will take a lot of RAM. Keep in mind each string has overhead above and beyond the bytes and the combination of the strings into a list will likely require re-allocations.
In addition, I think the subtle bug here is that you're trying to accumulate the results into a list with waiting = [ ... ]. I suspect you really meant that you wanted an iterator that fed gatherResults().
To remedy both these ills, write your file into "urls.txt" and try the following instead (also drop the bit with urls = [...]):
import sys.stdin
waiting = (client.getPage(url.strip() for url in sys.stdin)
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
Simply run using python script.py <urls.txt
The difference between [...] and (...) is quite large. [...] runs the ... part immediately, creating a giant list of the results; (...) creates a generator that will yield one result for each iteration in the ...
Note: I have not had a chance to test any of that (I don't use Twisted much) but from what you posted, these changes should help your RAM issue

Python GPS Module: Reading latest GPS Data

I have been trying to work with the standard GPS (gps.py) module in python 2.6. This is supposed to act as a client and read GPS Data from gpsd running in Ubuntu.
According to the documentation from GPSD webpage on client design (GPSD Client Howto), I should be able to use the following code (slightly modified from the example) for getting latest GPS Readings (lat long is what I am mainly interested in)
from gps import *
session = gps() # assuming gpsd running with default options on port 2947
session.stream(WATCH_ENABLE|WATCH_NEWSTYLE)
report = session.next()
print report
If I repeatedly use the next() it gives me buffered values from the bottom of the queue (from when the session was started), and not the LATEST Gps reading. Is there a way to get more recent values using this library? In a Way, seek the Stream to the latest values?
Has anyone got a code example using this library to poll the gps and get the value i am looking for ?
Here is what I am trying to do:
start the session
Wait for user to call the gps_poll() method in my code
Inside this method read the latest TPV (Time Position Velocity) report and return lat long
Go back to waiting for user to call gps_poll()
What you need to do is regularly poll 'session.next()' - the issue here is that you're dealing with a serial interface - you get results in the order they were received. Its up to you to maintain a 'current_value' that has the latest retrieved value.
If you don't poll the session object, eventually your UART FIFO will fill up and you won't get any new values anyway.
Consider using a thread for this, don't wait for the user to call gps_poll(), you should be polling and when the user wants a new value they use 'get_current_value()' which returns current_value.
Off the top of my head it could be something as simple as this:
import threading
import time
from gps import *
class GpsPoller(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.session = gps(mode=WATCH_ENABLE)
self.current_value = None
def get_current_value(self):
return self.current_value
def run(self):
try:
while True:
self.current_value = self.session.next()
time.sleep(0.2) # tune this, you might not get values that quickly
except StopIteration:
pass
if __name__ == '__main__':
gpsp = GpsPoller()
gpsp.start()
# gpsp now polls every .2 seconds for new data, storing it in self.current_value
while 1:
# In the main thread, every 5 seconds print the current value
time.sleep(5)
print gpsp.get_current_value()
The above answers are very inefficient and overly complex for anyone using modern versions of gpsd and needing data at only specific times, instead of streaming.
Most GPSes send their position information at least once per second. Presumably since many GPS-based applications desire real-time updates, the vast majority of gpsd client examples I've seen use the above method of watching a stream from gpsd and receiving realtime updates (more or less as often as the gps sends them).
However, if (as in the OP's case) you don't need streaming information but just need the last-reported position whenever it's requested (i.e. via user interaction or some other event), there's a much more efficient and simpler method: let gpsd cache the latest position information, and query it when needed.
The gpsd JSON protocol has a ?POLL; request, which returns the most recent GPS information that gpsd has seen. Instead of having to iterate over the backlog of gps messages, and continually read new messages to avoid full buffers, you can send a ?WATCH={"enable":true} message at the start of the gpsd session, and then query the latest position information whenever you need it with ?POLL;. The response is a single JSON object containing the most recent information that gpsd has seen from the GPS.
If you're using Python3, the easiest way I've found is to use the gpsd-py3 package available on pypi. To connect to gpsd, get the latest position information, and print the current position:
import gpsd
gpsd.connect()
packet = gpsd.get_current()
print(packet.position())
You can repeat the gpsd.get_current() call whenever you want new position information, and behind the scenes the gpsd package will execute the ?POLL; call to gpsd and return an object representing the response.
Doing this with the built-in gps module isn't terribly straightforward, but there are a number of other Python clients available, and it's also rather trivial to do with anything that can perform socket communication, including this example using telnet:
$ telnet localhost 2947
Trying ::1...
Connected to localhost.
Escape character is '^]'.
{"class":"VERSION","release":"3.16","rev":"3.16","proto_major":3,"proto_minor":11}
?WATCH={"enable":true}
{"class":"DEVICES","devices":[{"class":"DEVICE","path":"/dev/pts/10","driver":"SiRF","activated":"2018-03-02T21:14:52.687Z","flags":1,"native":1,"bps":4800,"parity":"N","stopbits":1,"cycle":1.00}]}
{"class":"WATCH","enable":true,"json":false,"nmea":false,"raw":0,"scaled":false,"timing":false,"split24":false,"pps":false}
?POLL;
{"class":"POLL","time":"2018-03-02T21:14:54.873Z","active":1,"tpv":[{"class":"TPV","device":"/dev/pts/10","mode":3,"time":"2005-06-09T14:34:53.280Z","ept":0.005,"lat":46.498332203,"lon":7.567403907,"alt":1343.165,"epx":24.829,"epy":25.326,"epv":78.615,"track":10.3788,"speed":0.091,"climb":-0.085,"eps":50.65,"epc":157.23}],"gst":[{"class":"GST","device":"/dev/pts/10","time":"1970-01-01T00:00:00.000Z","rms":0.000,"major":0.000,"minor":0.000,"orient":0.000,"lat":0.000,"lon":0.000,"alt":0.000}],"sky":[{"class":"SKY","device":"/dev/pts/10","time":"2005-06-09T14:34:53.280Z","xdop":1.66,"ydop":1.69,"vdop":3.42,"tdop":3.05,"hdop":2.40,"gdop":5.15,"pdop":4.16,"satellites":[{"PRN":23,"el":6,"az":84,"ss":0,"used":false},{"PRN":28,"el":7,"az":160,"ss":0,"used":false},{"PRN":8,"el":66,"az":189,"ss":45,"used":true},{"PRN":29,"el":13,"az":273,"ss":0,"used":false},{"PRN":10,"el":51,"az":304,"ss":29,"used":true},{"PRN":4,"el":15,"az":199,"ss":36,"used":true},{"PRN":2,"el":34,"az":241,"ss":41,"used":true},{"PRN":27,"el":71,"az":76,"ss":42,"used":true}]}]}
?POLL;
{"class":"POLL","time":"2018-03-02T21:14:58.856Z","active":1,"tpv":[{"class":"TPV","device":"/dev/pts/10","mode":3,"time":"2005-06-09T14:34:53.280Z","ept":0.005,"lat":46.498332203,"lon":7.567403907,"alt":1343.165,"epx":24.829,"epy":25.326,"epv":78.615,"track":10.3788,"speed":0.091,"climb":-0.085,"eps":50.65,"epc":157.23}],"gst":[{"class":"GST","device":"/dev/pts/10","time":"1970-01-01T00:00:00.000Z","rms":0.000,"major":0.000,"minor":0.000,"orient":0.000,"lat":0.000,"lon":0.000,"alt":0.000}],"sky":[{"class":"SKY","device":"/dev/pts/10","time":"2005-06-09T14:34:53.280Z","xdop":1.66,"ydop":1.69,"vdop":3.42,"tdop":3.05,"hdop":2.40,"gdop":5.15,"pdop":4.16,"satellites":[{"PRN":23,"el":6,"az":84,"ss":0,"used":false},{"PRN":28,"el":7,"az":160,"ss":0,"used":false},{"PRN":8,"el":66,"az":189,"ss":45,"used":true},{"PRN":29,"el":13,"az":273,"ss":0,"used":false},{"PRN":10,"el":51,"az":304,"ss":29,"used":true},{"PRN":4,"el":15,"az":199,"ss":36,"used":true},{"PRN":2,"el":34,"az":241,"ss":41,"used":true},{"PRN":27,"el":71,"az":76,"ss":42,"used":true}]}]}
Adding my two cents.
For whatever reason my raspberry pi would continue to execute a thread and I'd have to hard reset the pi.
So I've combined sysnthesizerpatel and an answer I found on Dan Mandel's blog here.
My gps_poller class looks like this:
import os
from gps import *
from time import *
import time
import threading
class GpsPoller(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.session = gps(mode=WATCH_ENABLE)
self.current_value = None
self.running = True
def get_current_value(self):
return self.current_value
def run(self):
try:
while self.running:
self.current_value = self.session.next()
except StopIteration:
pass
And the code in use looks like this:
from gps_poll import *
if __name__ == '__main__':
gpsp = GpsPoller()
try:
gpsp.start()
while True:
os.system('clear')
report = gpsp.get_current_value()
# print report
try:
if report.keys()[0] == 'epx':
print report['lat']
print report['lon']
time.sleep(.5)
except(AttributeError, KeyError):
pass
time.sleep(0.5)
except(KeyboardInterrupt, SystemExit):
print "\nKilling Thread.."
gpsp.running = False
gpsp.join()
print "Done.\nExiting."
You can also find the code here: Here and Here
I know its an old thread but just for everyone understanding, you can also use pyembedded python library for this.
pip install pyembedded
from pyembedded.gps_module.gps import GPS
import time
gps = GPS(port='COM3', baud_rate=9600)
while True:
print(gps.get_lat_long())
time.sleep(1)
https://pypi.org/project/pyembedded/

Getting fast translation of string data transmitted via a socket into objects in Python

I currently have a Python application where newline-terminated ASCII strings are being transmitted to me via a TCP/IP socket. I have a high data rate of these strings and I need to parse them as quickly as possible. Currently, the strings are being transmitted as CSV and if the data rate is high enough, my Python application starts to lag behind the input data rate (probably not all that surprising).
The strings look something like this:
chan,2007-07-13T23:24:40.143,0,0188878425-079,0,0,True,S-4001,UNSIGNED_INT,name1,module1,...
I have a corresponding object that will parse these strings and store all of the data into an object. Currently the object looks something like this:
class ChanVal(object):
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
for key in kwargs:
setattr(self,key,kwargs[key])
def parseFromCsv(self, csvString):
lst = csvString.split(',')
self.eventTime=lst[1]
self.eventTimeExact=long(lst[2])
self.other_clock=lst[3]
...
To read the data in from the socket, I'm using a basic "socket.socket(socket.AF_INET,socket.SOCK_STREAM)" (my app is the server socket) and then I'm using the "select.poll()" object from the "select" module to constantly poll the socket for new input using its "poll(...)" method.
I have some control over the process sending the data (meaning I can get the sender to change the format), but it would be really convenient if we could speed up the ASCII processing enough to not have to use fixed-width or binary formats for the data.
So up until now, here are the things I've tried and haven't really made much of a difference:
Using the string "split" method and then indexing the list of results directly (see above), but "split" seems to be really slow.
Using the "reader" object in the "csv" module to parse the strings
Changing the strings being sent to a string format that I can use to directly instantiate an object via "eval" (e.g. sending something like "ChanVal(eventTime='2007-07-13T23:24:40.143',eventTimeExact=0,...)")
I'm trying to avoid going to a fixed-width or binary format, though I realize those would probably ultimately be much faster.
Ultimately, I'm open to suggestions on better ways to poll the socket, better ways to format/parse the data (though hopefully we can stick with ASCII) or anything else you can think of.
Thanks!
You can't make Python faster. But you can make your Python application faster.
Principle 1: Do Less.
You can't do less input parsing over all but you can do less input parsing in the process that's also reading the socket and doing everything else with the data.
Generally, do this.
Break your application into a pipeline of discrete steps.
Read the socket, break into fields, create a named tuple, write the tuple to a pipe with something like pickle.
Read a pipe (with pickle) to construct the named tuple, do some processing, write to another pipe.
Read a pipe, do some processing, write to a file or something.
Each of these three processes, connected with OS pipes, runs concurrently. That means that the first process is reading the socket and make tuples while the second process is consuming tuples and doing calculations while the third process is doing calculations and writing a file.
This kind of pipeline maximizes what your CPU can do. Without too many painful tricks.
Reading and writing to pipes is trivial, since linux assures you that sys.stdin and sys.stdout will be pipes when the shell creates the pipeline.
Before doing anything else, break your program into pipeline stages.
proc1.py
import cPickle
from collections import namedtuple
ChanVal= namedtuple( 'ChanVal', ['eventTime','eventTimeExact', 'other_clock', ... ] )
for line socket:
c= ChanVal( **line.split(',') )
cPickle.dump( sys.stdout )
proc2.py
import cPickle
from collections import namedtuple
ChanVal= namedtuple( 'ChanVal', ['eventTime','eventTimeExact', 'other_clock', ... ] )
while True:
item = cPickle.load( sys.stdin )
# processing
cPickle.dump( sys.stdout )
This idea of processing namedtuples through a pipeline is very scalable.
python proc1.py | python proc2.py
You need to profile your code to find out where the time is being spent.
That doesn't necessarily mean using python's profiler
For example you can just try parsing the same csv string 1000000 times with different methods. Choose the fastest method - divide by 1000000 now you know how much CPU time it takes to parse a string
Try to break the program into parts and work out how what resources are really required by each part.
The parts that need the most CPU per input line are your bottle necks
On my computer, the program below outputs this
ChanVal0 took 0.210402965546 seconds
ChanVal1 took 0.350302934647 seconds
ChanVal2 took 0.558166980743 seconds
ChanVal3 took 0.691503047943 seconds
So you see that about half the time there is taken up by parseFromCsv. But also that quite a lot of time is taken extracting the values and storing them in the class.
If the class isn't used right away it might be faster to store the raw data and use properties to parse the csvString on demand.
from time import time
import re
class ChanVal0(object):
def __init__(self, csvString=None,**kwargs):
self.csvString=csvString
for key in kwargs:
setattr(self,key,kwargs[key])
class ChanVal1(object):
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
for key in kwargs:
setattr(self,key,kwargs[key])
def parseFromCsv(self, csvString):
self.lst = csvString.split(',')
class ChanVal2(object):
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
for key in kwargs:
setattr(self,key,kwargs[key])
def parseFromCsv(self, csvString):
lst = csvString.split(',')
self.eventTime=lst[1]
self.eventTimeExact=long(lst[2])
self.other_clock=lst[3]
class ChanVal3(object):
splitter=re.compile("[^,]*,(?P<eventTime>[^,]*),(?P<eventTimeExact>[^,]*),(?P<other_clock>[^,]*)")
def __init__(self, csvString=None,**kwargs):
if csvString is not None:
self.parseFromCsv(csvString)
self.__dict__.update(kwargs)
def parseFromCsv(self, csvString):
self.__dict__.update(self.splitter.match(csvString).groupdict())
s="chan,2007-07-13T23:24:40.143,0,0188878425-079,0,0,True,S-4001,UNSIGNED_INT,name1,module1"
RUNS=100000
for cls in ChanVal0, ChanVal1, ChanVal2, ChanVal3:
start_time = time()
for i in xrange(RUNS):
cls(s)
print "%s took %s seconds"%(cls.__name__, time()-start_time)

Categories