Build/Parse an XML document from a socketstream in Python

Build/Parse an XML document from a socketstream in Python - python

I have an issue parsing a continuous stream of (multiple) xml documents sent by a third party over a socket. A sample of the xml stream sent over the socket is:
<?xml version="1.0"?><event><user id="1098"/><viewpage>109958</viewpage></event>
<?xml version="1.0"?><event><user id="1482"/><actions><edit>102865</edit><commit>1592356</commit></actions></event>
etc.
Here's the code I'm using:
import socket
import xml.etree.cElementTree as etree
from StringIO import StringIO
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
host = "IP.IP.IP.IP"
port = "8080"
addr = (host,port)
s.connect(addr)
def iparse(packet):
for _, element in etree.iterparse(packet):
print ("%s, %s" %(element.tag, element.text))
element.clear()
#if complete <event> node received, publish node
data = "<feeds>"
while 1:
chunk = s.recv(1024)
#replace the xml doc declarations as comments
data += (chunk.replace("<?","<!--")).replace("?>","-->")
iparse(StringIO(data))
Things work just fine...however, the for loop in iparse iterates through the entire doc each time. Is it possible for iparse to build and iterate through one well-formed tag node (event) as it appears over the stream instead? Note that there is no way in which I can set the chunk size to read a well-formed packet. I could use a buffer, then build the packet and only send to iparse once the packet is well-formed, but that would likely introduce unwanted latency? Is there a better way to handle this?
EDIT:
Each event is distinct but contains arbitrary nodes under the root <event>. iparse is expected to publish the latest event to an arbitrary number of subscribers within a real-time analytics graphing system.

You could have a look at the feed parsing in lxml.etree. However, you're still going to run into problems as your document is continually growing.
Are the XML blobs separated by new line characters? If so I suggest that you buffer until you hit a new line and then send each line to an xml parser. Á la Twisted's LineReceiver.
Actually, if it was me, I'd probably write this application in Twisted. Gluing together network services is a common use case for it for me.

Related

Why these Python send / receive socket functions work if invoked slowly, but fail if invoked quickly in a row?

I have a client and a server, where the server needs to send a number of text files to the client.
The send file function receives the socket and the path of the file to send:
CHUNKSIZE = 1_000_000
def send_file(sock, filepath):
with open(filepath, 'rb') as f:
sock.sendall(f'{os.path.getsize(filepath)}'.encode() + b'\r\n')
# Send the file in chunks so large files can be handled.
while True:
data = f.read(CHUNKSIZE)
if not data:
break
sock.send(data)
And the receive file function receives the client socket and the path where to save the incoming file:
CHUNKSIZE = 1_000_000
def receive_file(sock, filepath):
with sock.makefile('rb') as file_socket:
length = int(file_socket.readline())
# Read the data in chunks so it can handle large files.
with open(filepath, 'wb') as f:
while length:
chunk = min(length, CHUNKSIZE)
data = file_socket.read(chunk)
if not data:
break
f.write(data)
length -= len(data)
if length != 0:
print('Invalid download.')
else:
print('Done.')
It works by sending the file size as the first line, then sending the text file line by line.
Both are invoked in loops in the client and the server, so that files are sent and saved one by one.
It works fine if I put a breakpoint and invoke these functions slowly. But If I let the program run uninterrupted, it fails when reading the size of the second file:
File "/home/stark/Work/test/networking.py", line 29, in receive_file
length = int(file_socket.readline())
ValueError: invalid literal for int() with base 10: b'00,1851,-34,-58,782,-11.91,13.87,-99.55,1730,-16,-32,545,-12.12,19.70,-99.55,1564,-8,-10,177,-12.53,24.90,-99.55,1564,-8,-5,88,-12.53,25.99,-99.55,1564,-8,-3,43,-12.53,26.54,-99.55,0,60,0\r\n'
Clearly a lot more data is being received by that length = int(file_socket.readline()) line.
My questions: why is that? Shouldn't that line read only the size given that it's always sent with a trailing \n?
How can I fix this so that multiple files can be sent in a row?
Thanks!

It seems like you're reusing the same connection and what happens is your file_socket being buffered means... you've actually recved more from your socket then you'd think with your read loop.
I.e. the receiver consumes more data from your socket and next time you attempt to readline() you end up reading rest of the previous file up to the new line contained therein or of the next length information.
This also means your initial problem actually is you've skipped a while. Effect of which is next read line is not an int you expected and hence the observed failure.
You can say:
with sock.makefile('rb', buffering=0) as file_socket:
instead to force the file like access being unbuffered. Or actually handle the receiving and buffering and parsing of incoming bytes (understanding where one file ends and the next one begins) on your own (instead of file like wrapper and readline).

You have to understand that socket communication is based on TCP/IP, does not matter if it's same machine (you use loopback in such cases) or different machines. So, you've got some IP addresses between which the connection is established. Going further, it involves accessing your network adapter, ie takes relatively long in comparison to accessing eg. RAM. Additionally, the adapter itself manages when to send particular data frames (lower ISO/OSI layers). Basically, in case of TCP there's ACK required, but on standard PC this is usually not some industrial, real-time ethernet.
So, in your code, you've got a while True loop without any sleep and you don't check what does sock.send returns. Even if something goes wrong with particular data frame, you ignore it and try to send next. On first glance it appears that something has been cached and receiver received what was flushed once connection was re-established.
So, first thing which you should do is check if sock.send indeed returned number of bytes sent. If not, I believe the frame should be re-sent. Another thing which I strongly recommend in such cases is think of some custom protocol (this is usually called application layer in context of OSI/ISO stack). For example, you might have 4 types of frames: START, FILESIZE, DATA, END, assign unique ID and start each frame with the identifier. Then, START is gonna be empty, FILESIZE gonna contain single uint16, DATA is gonna contain {FILE NUMBER, LINE NUMBER, LINE_LENGTH, LINE} and END is gonna be empty. Then, once you've got entire frame on the client, you can safely assemble the information you received.

Parse XML input from UDP port

I'm a newbie in python, just done only a couple of scripts.
Now I need to listen and process a xml text that is being received from a udp socket.
By the moment I have the first part but i don't know how to proceed.
import socket
import lxml.etree
port = 7059
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.bind(("", port))
print "waiting on port:", port
while 1:
data, addr = s.recvfrom(1024)
print data
I'm receiving the data ok, and it's printed on the screen:
<electricity id='4437190066CD'><timestamp>1532728995</timestamp><signal rssi='-78' lqi='94'/><battery level='10%'/><chan id='0'><curr units='w'>609.00</curr><day units='wh'>34.64</day></chan><chan id='1'><curr units='w'>42.00</curr><day units='wh'>2.38</day></chan><chan id='2'><curr units='w'>538.00</curr><day units='wh'>30.43</day></chan></electricity>
But I need to get or parse de values in bold:
<chan id='0'><curr units='w'>609.00</curr>
<chan id='1'><curr units='w'>42.00</curr>
<chan id='2'><curr units='w'>538.00</curr>
and asign to something like a var with it's sub objects:
power.ch0 = 609.00
power.ch1 = 42.00
power.ch2 = 538.00
with that variable, then I will do a request to my porwer monitoring system api to send these values.
I usually use bash for scripting, and I'm pretty happy with it, but I think that this time not rich enough and python seems to be my solution
Thanks in advance for your help!!

There are several ways to do this. First, I do not see where you are using the xml.etree module. Take a look at the docs real fast: https://docs.python.org/3/library/xml.etree.elementtree.html
Also, you can achieve this with BeautifulSoup as well: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Lastly, you can achieve this with the .replace() and .strip() functions as well

Getting a particular package from the pcap file using the Scapy module (python)

Is there a way to load a particular package from the pcap file using Scapy?
I know that I can load a specific number of packages using the' sniff' function and count attribute, e. g.'
sniff(offline='file.pcap', prn=action, count=31)
However, I need to get a 30th packet without loading the previous packets.
In other words, I am not satisfied with such an example:
packages = (pkt for pkt in sniff (offline=path, prn=action, count=31)
print(packages[30])
The attempt to load a millionth of a package is too long.

Each packet header states how long it is. Once the parser has read that header, it can calculate the position of the next one. So as far as I know, you cannot open a pcap file and instantly locate packet 30; you'll need to parse the headers of the first 29.
But you don't have to keep all packets in memory either, as long as you process them while receiving.
i = 0
for pkt in sniff(offline=path, prn=action):
if i == 30:
print pkt
break

Python - Fetching data from two serial ports in realtime

My project is a directional antenna which is mounted on a self-stabilizing base. The language I wish to use is python, but changing this to a more suited language, is a possibility, if needed.
Problem 1:
How would you go about taking in serial data in real-time[1], and then parse the data in python?
Problem 2:
How can I then send the output of the program to servos, which are mounted on the base? (feedback system).
[1](Fastest possible time for data transfer, processing and then output)

You can use the pyserial module to read serial port data with Python. See: http://pyserial.sourceforge.net/shortintro.html
Here's a short usage example from the docs:
>>> ser = serial.Serial('/dev/ttyS1', 19200, timeout=1)
>>> x = ser.read() # read one byte
>>> s = ser.read(10) # read up to ten bytes (timeout)
>>> line = ser.readline() # read a '\n' terminated line
>>> ser.close()
Next, you'll need to parse the GPS data. Most devices support "NMEA 0183" format, and here's another SO question with information about parsing that with Python: Parsing GPS receiver output via regex in Python
Finally, outputting data for servo control will depend entirely on whatever hardware you are using for the servo interface.

Download file using partial download (HTTP)

Is there a way to download huge and still growing file over HTTP using the partial-download feature?
It seems that this code downloads file from scratch every time it executed:
import urllib
urllib.urlretrieve ("http://www.example.com/huge-growing-file", "huge-growing-file")
I'd like:
To fetch just the newly-written data
Download from scratch only if the source file becomes smaller (for example it has been rotated).

It is possible to do partial download using the range header, the following will request a selected range of bytes:
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, end)
f = urllib2.urlopen(req)
For example:
>>> req = urllib2.Request('http://www.python.org/')
>>> req.headers['Range'] = 'bytes=%s-%s' % (100, 150)
>>> f = urllib2.urlopen(req)
>>> f.read()
'l1-transitional.dtd">\n\n\n<html xmlns="http://www.w3.'
Using this header you can resume partial downloads. In your case all you have to do is to keep track of already downloaded size and request a new range.
Keep in mind that the server need to accept this header for this to work.

This is quite easy to do using TCP sockets and raw HTTP. The relevant request header is "Range".
An example request might look like:
mysock = connect(("www.example.com", 80))
mysock.write(
"GET /huge-growing-file HTTP/1.1\r\n"+\
"Host: www.example.com\r\n"+\
"Range: bytes=XXXX-\r\n"+\
"Connection: close\r\n\r\n")
Where XXXX represents the number of bytes you've already retrieved. Then you can read the response headers and any content from the server. If the server returns a header like:
Content-Length: 0
You know you've got the entire file.
If you want to be particularly nice as an HTTP client you can look into "Connection: keep-alive". Perhaps there is a python library that does everything I have described (perhaps even urllib2 does it!) but I'm not familiar with one.

If I understand your question correctly, the file is not changing during download, but is updated regularly. If that is the question, rsync is the answer.
If the file is being updated continually including during download, you'll need to modify rsync or a bittorrent program. They split files into separate chunks and download or update the chunks independently. When you get to the end of the file from the first iteration, repeat to get the appended chunk; continue as necessary. With less efficiency, one could just repeatedly rsync.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.