Incremental parse of appended data to external XML file in Python - python

I've got a log file on external computer in my LAN network. Log is an XML file. File is not accessible from http, and is updating every second.
Currently i'm copying log file into my computer and run parser, but I want to parse file directly from external host.
How can I do it in Python? Is it possible, to parse whole file once, and later parse only new content added to the end in future versions?

You can use paramiko and xml.sax's default parser, xml.sax.expatreader, which implements xml.sax.xmlreader.IncrementalParser.
I ran the following script on local virtual machine to produce XML.
#!/bin/bash
echo "<root>" > data.xml
I=0
while sleep 2; do
echo "<entry><a>value $I</a><b foo='bar' /></entry>" >> data.xml;
I=$((I + 1));
done
Here's incremental consumer.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import time
import xml.sax
from contextlib import closing
import paramiko.client
class StreamHandler(xml.sax.handler.ContentHandler):
lastEntry = None
lastName = None
def startElement(self, name, attrs):
self.lastName = name
if name == 'entry':
self.lastEntry = {}
elif name != 'root':
self.lastEntry[name] = {'attrs': attrs, 'content': ''}
def endElement(self, name):
if name == 'entry':
print({
'a' : self.lastEntry['a']['content'],
'b' : self.lastEntry['b']['attrs'].getValue('foo')
})
self.lastEntry = None
def characters(self, content):
if self.lastEntry:
self.lastEntry[self.lastName]['content'] += content
if __name__ == '__main__':
# use default ``xml.sax.expatreader``
parser = xml.sax.make_parser()
parser.setContentHandler(StreamHandler())
client = paramiko.client.SSHClient()
# or use ``client.load_system_host_keys()`` if appropriate
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect('192.168.122.40', username = 'root', password = 'pass')
with closing(client) as ssh:
with closing(ssh.open_sftp()) as sftp:
with closing(sftp.open('/root/data.xml')) as f:
while True:
buffer = f.read(4096)
if buffer:
parser.feed(buffer)
else:
time.sleep(2)

I am assuming that another process of which you don't have access to is maintaining the xml as an object being updated every so often, and then dumping the result.
If you don't have access to the source of the program dumping the XML, you will need a fancy diffing between the two XML versions to get an incremental update to send over the network.
And I think you would have to parse the new XML each time to be able to have that diff.
So maybe you could have a python process watching the file, parsing the new version, diffing it (for instance using solutions from this article), and then you can send that difference over the network using a tool like xmlrpc. If you want to save bandwidth it'll probably help. Although I think I would send directly the raw diff via network, patch and parse the file in the local machine.
However, if only some of your XML values are changing (no node deletion or insertion) there may be a faster solution. Or, if the only operation on your xml file is to append new trees, then you should be able to parse only these new trees and send them over (diff first, then parse in the server, send to client, merge in client).

Related

Loading and dumping multiple yaml files with ruamel.yaml (python)

Using python 2 (atm) and ruamel.yaml 0.13.14 (RedHat EPEL)
I'm currently writing some code to load yaml definitions, but they are split up in multiple files. The user-editable part contains eg.
users:
xxxx1:
timestamp: '2018-10-22 11:38:28.541810'
<< : *userdefaults
xxxx2:
<< : *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
the defaults are stored in another file, which is not editable:
userdefaults: &userdefaults
# Default values for user settings
fileCountQuota: 1000
diskSizeQuota: "300g"
I can process these together by loading both and concatinating the strings, and then running them through merged_data = list(yaml.load_all("{}\n{}".format(defaults_data, user_data), Loader=yaml.RoundTripLoader)) which correctly resolves everything. (when not using RoundTripLoader I get errors that the references cannot be resolved, which is normal)
Now, I want to do some updates via python code (eg. update the timestamp), and for that I need to just write back the user part. And that's where things get hairy. I sofar haven't found a way to just write that yaml document, not both.
First of all, unless there are multiple documents in your defaults file, you
don't have to use load_all, as you don't concatenate two documents into a
multiple-document stream. If you had by using a format string with a document-end
marker ("{}\n...\n{}") or with a directives-end marker ("{}\n---\n{}")
your aliases would not carry over from one document to another, as per the
YAML specification:
It is an error for an alias node to use an anchor that does not
previously occur in the document.
The anchor has to be in the document, not just in the stream (which can consist of multiple
documents).
I tried some hocus pocus, pre-populating the already represented dictionary
of anchored nodes:
import sys
import datetime
from ruamel import yaml
def load():
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
return merged_data
class MyRTDGen(object):
class MyRTD(yaml.RoundTripDumper):
def __init__(self, *args, **kw):
pps = kw.pop('pre_populate', None)
yaml.RoundTripDumper.__init__(self, *args, **kw)
if pps is not None:
for pp in pps:
try:
anchor = pp.yaml_anchor()
except AttributeError:
anchor = None
node = yaml.nodes.MappingNode(
u'tag:yaml.org,2002:map', [], flow_style=None, anchor=anchor)
self.represented_objects[id(pp)] = node
def __init__(self, pre_populate=None):
assert isinstance(pre_populate, list)
self._pre_populate = pre_populate
def __call__(self, *args, **kw):
kw1 = kw.copy()
kw1['pre_populate'] = self._pre_populate
myrtd = self.MyRTD(*args, **kw1)
return myrtd
def update(md, file_name):
ud = md.pop('userdefaults')
MyRTD = MyRTDGen([ud])
yaml.dump(md, sys.stdout, Dumper=MyRTD)
with open(file_name, 'w') as fp:
yaml.dump(md, fp, Dumper=MyRTD)
md = load()
md['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
update(md, 'user.yaml')
Since the PyYAML based API requires a class instead of an object, you need to
use a class generator, that actually adds the data elements to pre-populate on
the fly from withing yaml.load().
But this doesn't work, as a node only gets written out with an anchor once it is
determined that the anchor is used (i.e. there is a second reference). So actually the
first merge key gets written out as an anchor. And although I am quite familiar
with the code base, I could not get this to work properly in a reasonable amount of time.
So instead, I would just rely on the fact that there is only one key that matches
the first key of users.yaml at the root level of the dump of the combined updated
file and strip anything before that.
import sys
import datetime
from ruamel import yaml
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
# find the key
for line in user_data.splitlines():
line = line.split('# ')[0].rstrip() # end of line comment, not checking for strings
if line and line[-1] == ':' and line[0] != ' ':
split_key = line
break
merged_data['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
buf = yaml.compat.StringIO()
yaml.dump(merged_data, buf, Dumper=yaml.RoundTripDumper)
document = split_key + buf.getvalue().split('\n' + split_key)[1]
sys.stdout.write(document)
which gives:
users:
xxxx1:
<<: *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
xxxx2:
<<: *userdefaults
timestamp: '2018-10-23 09:59:13.829978'
I had to make a virtualenv to make sure I could run the above with ruamel.yaml==0.13.14.
That version is from the time I was still young (I won't claim to have been innocent).
There have been over 85 releases of the library since then.
I can understand that you might not be able to run anything but
Python2 at the moment and cannot compile/use a newer version. But what
you really should do is install virtualenv (can be done using EPEL, but also without
further "polluting" your system installation), make a virtualenv for the
code you are developping and install the latest version of ruamel.yaml (and
your other libraries) in there. You can also do that if you need
to distribute your software to other systems, just install virtualenv there as well.
I have all my utilties under /opt/util, and managed
virtualenvutils a
wrapper around virtualenv.
For writing the user part, you will have to manually split the output of yaml.dump() multifile output and write the appropriate part back to users yaml file.
import datetime
import StringIO
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='rt')
data = None
with open('defaults.yaml', 'r') as defaults:
with open('users.yaml', 'r') as users:
raw = "{}\n{}".format(''.join(defaults.readlines()), ''.join(users.readlines()))
data = list(yaml.load_all(raw))
data[0]['users']['xxxx1']['timestamp'] = datetime.datetime.now().isoformat()
with open('users.yaml', 'w') as outfile:
sio = StringIO.StringIO()
yaml.dump(data[0], sio)
out = sio.getvalue()
outfile.write(out.split('\n\n')[1]) # write the second part here as this is the contents of users.yaml

How to make secure connection with python?

I'm using python3. i need to use certificates file to make secure connection.
In this case, i used Httpsconnection class from http.client...
this class get certs file path and use it. like this:
import http.client
client=http.client.HTTPSConnection\
("epp.nic.ir",key_file="filepath\\nic.pem",cert_file="filepath\\nic.crt")
As you see, this class get path of files and works correctly.
But I need to give contents of these files. Because I want to put contents of crt file and pem file into DB. the reason is that maybe files path changes...
so i tried this:
import http.client
import base64
cert = b'''
content of cert file
'''
pem = b'''
content of pem file
'''
client=http.client.HTTPSConnection("epp.nic.ir" ,pem, cert)
as expected, i got this error:
TypeError: certfile should be a valid filesystem path
is there any way to make this class to get content of file instead of file path ?!
Or Is it possible to make changes in source codes of http for this purpose ?!
It is possible to modify Python source code, but it is not the recommended way as it definitely brings about portability, maintainability and other issues.
Consider you want to update Python version, you have to apply your modification each time you update it.
Consider you want to run your code in another machine, again the same problem.
Instead of modifying the source code, there is a better and preferrable way: extending the API.
You can subclass the existing HTTPSConnection class and override its constructor method by your own implementation.
There are plenty of ways to achieve what you need.
Here is a possible solution with subclassing:
import http.client
import tempfile
class MyHTTPSConnection(http.client.HTTPSConnection):
"""HTTPSConnection with key and cert files passed as contents rather than file names"""
def __init__(self, host, key_content=None, cert_content=None, **kwargs):
# additional parameters are also optional so that
# so that this class can be used with or without cert/key files
# as a replacement of standard HTTPSConnection
self.key_file = None
self.cert_file = None
# here we write the content of cert & pem into a temporary file
# delete=False keeps the file in the file system
# but, this time we need to remove it manually when we are done
if key_content:
self.key_file = tempfile.NamedTemporaryFile(delete=False)
self.key_file.write(key_content)
self.key_file.close()
# NamedTemporaryFile object provides 'name' attribute
# which is a valid file name in the file system
# so we can use those file names to initiate the actual HTTPSConnection
kwargs['key_file'] = self.key_file.name
# same as above but this time for cert content and cert file
if cert_content:
self.cert_file = tempfile.NamedTemporaryFile(delete=False)
self.cert_file.write(cert_content)
self.cert_file.close()
kwargs['cert_file'] = self.cert_file.name
# initialize super class with host and keyword arguments
super().__init__(host, **kwargs)
def clean(self):
# remove temp files from the file system
# you need to decide when to call this method
os.unlink(self.cert_file.name)
os.unlink(self.pem_file.name)
host = "epp.nic.ir"
key_content = b'''content of key file'''
cert_content = b'''content of cert file'''
client = MyHTTPSConnection(host, key_content=key_content, cert_content=cert_content)
# ...

Filter protocols in python

I'm trying to filter certain packets with protocols using user input from a given pcap file and than move the packets to a new pcap file.
That the code I made so far:
# ===================================================
# Imports
# ===================================================
from scapy.all import *
from scapy.utils import PcapWriter
"""
your going to need to install the modules below
"""
from Tkinter import Tk
from tkFileDialog import askopenfilename
# ===================================================
# Constants
# ===================================================
#OS commands:
#~~~~~~~~~~~~~
if "linux2" in sys.platform:
"""
linux based system clear command
"""
CLEAR_COMMAND = "clear"
elif "win32" in sys.platform:
"""
windows based system clear command
"""
CLEAR_COMMAND = "cls"
elif "cygwin" in sys.platform:
"""
crygwin based clear command
"""
CLEAR_COMMAND = "printf \"\\033c\""
elif "darwin" in sys.platform:
"""
mac OS X based clear command
"""
CLEAR_COMMAND = "printf \'\\33c\\e[3J\'"
#Usage string:
#~~~~~~~~~~~~~~
FILE_STRING = "please choose a pcap file to use"
BROWSE_STRING = "press any key to browser files\n"
BAD_PATH_STRING = "bad file please try agien\n"
BAD_INPUT_STRING = "bad input please try agien\n"
PROTOCOL_STRING = "please enter the protocol you wish to filter\n"
NAME_STRING = "please enter the new pcap file name\n"
# ===================================================
# Code
# ===================================================
def filter_pcap():
"""
filtering from the given pcap file a protocol the user chooce (from any layer in the OSI model)
and than asks for a new pcap file name, than filters the given protocol to a new pcap file
:param none
:return nothing:
"""
path = file_browse()
i = 0
filtertype = raw_input(PROTOCOL_STRING)
name = raw_input(NAME_STRING)
packs = rdpcap(path)
for i in range(len(packs)):
if filtertype in packs[i]:
wrpcap(name +".pcap", packs[i])
def file_browse():
"""
Purpose: It will allow the user to browse files on his computer
than it will check if the path is ok and will return it
:returns - the path to the chosen pcap file
"""
path = "test"
while ".pcap" not in path:
print FILE_STRING
raw_input(BROWSE_STRING)
os.system(CLEAR_COMMAND)
Tk().withdraw()
path = askopenfilename()
if ".pcap" not in path:
print BAD_PATH_STRING
return path
filter_pcap()
Now the problem is that I'm failing to filter the packets correctly.
The code need to filter protocols from any layer and any kind.
I have checked that thread: How can I filter a pcap file by specific protocol using python?
But as you can see it was not answered and the user added the problems I had in the edit, if any one could help me it would be great
Example for how it should work:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lets say i use the file "sniff" as my first pcap file and it has 489 packets when 200 from those packets are http packets.
now there is that print:
please enter the protocol you wish to filter
'http'
and than there is the print:
please enter the new pcap file name
'new'
the user input was 'http' now the program will search for every packet that run on http protocol and will create a new pcap file called 'new.pcap'.
the file 'new.pcap' will contain 200 http packets.
now that thing should work with any protocol on the OSI model including protocols like IP, TCP, Ethernet and so on (all the protocols in the ascii model).
I have found out that wireshark command line has the option -R and tshark has .protocols, but it dont really work... Can any one check that?
edit: i found pyshark but i dont know how to write with it
I don't believe that scapy has any functions or methods to support application layer protocols in the way that you are after. However, using sport and dport as a filter will do the trick (provided you are going to see/are expecting default ports).
Try something like this:
def filter_pcap(filtertype = None):
..... snip .....
# Static Dict of port to protocol values. Like so:
protocols = {'http': 80, 'ssh': 22, 'telnet': 21}
# Check to see if it is in the list
while filtertype not in protocols:
filtertype = raw_input(PROTOCOL_STRING)
# Name for output file
name = raw_input(NAME_STRING)
# Read Input File
packs = rdpcap(path)
# Filters to only TCP packets
for packet in packs[TCP]:
# Filter only the proto (aka Port) that we want
if protocols[filtertype] in packet.sport or protocols[filtertype] in packet.dport :
# Write to file
wrpcap(name +".pcap", packet)

How to use pyfig module in python?

This is the part of the mailer.py script:
config = pyfig.Pyfig(config_file)
svnlook = config.general.svnlook #svnlook path
sendmail = config.general.sendmail #sendmail path
From = config.general.from_email #from email address
To = config.general.to_email #to email address
what does this config variable contain? Is there a way to get the value for config variable without pyfig?
In this case config = a pyfig.Pyfig object initialised with the contents of the file named by the content of the string config_file.
To find out what that object does and contains you can either look at the documentation and/or the source code, both here, or you can print out, after the initialisation, e.g.:
config = pyfig.Pyfig(config_file)
print "Config Contains:\n\t", '\n\t'.join(dir(config))
if hasattr(config, "keys"):
print "Config Keys:\n\t", '\n\t'.join(config.keys())
or if you are using Python 3,
config = pyfig.Pyfig(config_file)
print("Config Contains:\n\t", '\n\t'.join(dir(config)))
if hasattr(config, "keys"):
print("Config Keys:\n\t", '\n\t'.join(config.keys()))
To get the same data without pyfig you would need to read and parse at the content of the file referenced by config_file within your own code.
N.B.: Note that pyfig seems to be more or less abandoned - no updates in over 5 years, web site no longer exists, etc., so I would strongly recommend converting the code to use a json configuration file instead.

How to access header information sent to http.server from an AJAX client using python 3+?

I am trying to read data sent to python's http.server from a local javascript program posted with AJAX. Everything works in python 2.7 as in this example, but now in python 3+ I can't access the header anymore to get the file length.
# python 2.7 (works!)
class handler_class(SimpleHTTPServer.SimpleHTTPRequestHandler):
def do_POST(self):
if self.path == '/data':
length = int(self.headers.getheader('Content-Length'))
NewData = self.rfile.read(length)
I've discovered I could use urllib.request, as I have mocked up below. However, I am running on a localhost and don't have a full url as I've seen in the examples, and I am starting to second guess if this is even the right way to go? Frustratingly, I can see the content-length printed out in the console, but I can't access it.
# python 3+ (url?)
import urllib.request
class handler_class(http.server.SimpleHTTPRequestHandler):
def do_POST(self):
if self.path == '/data':
print(self.headers) # I can see the content length here but cannot access it!
d = urllib.request.urlopen(url) # what url?
length = int(d.getheader('Content-Length'))
NewData = self.rfile.read(length)
Various url's I have tried are:
self.path
http://localhost:8000/data
/data
and I generally get this error:
ValueError: unknown url type: '/data'
So why is 'urllib.request' failing me and more importantly, how does one access 'self.header' in this Python3 world?

Categories