How to convert an XML tree object to bytes stream? Python - python

I have a function that saves files to a db, but this one requires a bytes stream as parameter. Something like:
write_to_db("File name", stream_obj)
Now, I want to save a XML; I am using the xml library.
import xml.etree.cElementTree as ET
Is there a function that convert the xml object to bytes stream?
The solution I got was:
Save it locally with the function write
Retrieve it with "rb" to get the file as bytes
Now that I have the bytes stream, save it with the function mentioned
Delete the file
Example:
# Saving xml as local file
tree = ET.ElementTree(ET.Element("Example")
tree.write("/This/is/a/path.xml")
# Reading local file as bytes
f = open("/This/is/a/path.xml", "rb")
# Saving to DB
write_to_db("File name", f) # <--- See how I am using "f" cuz I opened it as bytes with rb
# Deleting local file
os.remove("/This/is/a/path.xml")]
But is there a function from the xml library that returns automatically the bytes stream? Something like:
tree = ET.ElementTree(ET.Element("Example")
bytes_file = tree.get_bytes() # <-- Like this?
# Writing to db
write_to_db("File name", bytes_file)
This so I can prevent creating and removing the file in my repository.
Thank you in advance.
Another fast question:
Are the words "bytes stream" correct? or what is the difference? what would be the correct words that I am looking for?

So as Balmy mentioned in the comments, the solution is using:
ET.tostring()
My code at the end looked something like this:
# Here you build your xml
x = ET.Element("ExampleXML",{"a tag": "1", "another tag": "2"})
# Here I am saving it to my db by using the "tostring" function,
# Which as default return the xml as a bytes stream string.
write_to_db("File name", ET.tostring(x))

Related

Writing a Python pdfrw PdfReader object to an array of bytes / filestream

I'm currently working on a simple proof of concept for a pdf-editor application. The example is supposed to be a simplified python script showcasing how we could use the pdfrw library to edit PDF files with forms in them.
So, here's the issue. I'm not interested in writing the edited PDF to a file.
The idea is that file opening and closing is going to most likely be handled by external code and so I want all the edits in my files to be done in memory. I don't want to write the edited filestream to a local file.
Let me specify what I mean by this. I currently have a piece of code like this:
class FormFiller:
def __fill_pdf__(input_pdf_filestream : bytes, data_dict : dict):
template_pdf : pdfrw.PdfReader = pdfrw.PdfReader(input_pdf_filestream)
# <some editing magic here>
return template_pdf
def fillForm(self,mapper : FieldMapper):
value_mapping : dict = mapper.getValues()
filled_pdf : pdfrw.PdfReader = self.__fill_pdf__(self.filesteam, value_mapping)
#<this point is crucial>
def __init__(self, filestream : bytes):
self.filesteam : bytes = filestream
So, as you see the FormFiller constructor receives an array of bytes. In fact, it's an io.BytesIO object. The template_pdf variable uses a PdfReader object from the pdfrw library. Now, when we get to the #<this point is crucial> marker, I have a filled_pdf variable which is a PdfReader object. I would like to convert it to a filestream (a bytes array, or an io.BytesIO object if you will), and return it in that form. I don't want to write it to a file. However, the writer class provided by pdfrw (pdfrw.PdfWriter) does not allow for such an operation. It only provides a write(<filename>) method, which saves the PdfReader object to a pdf output file.
How should I approach this? Do you recommend a workaround? Or perhaps I should use a completely different library to accomplish this?
Please help :-(
To save your altered PDF to memory in an object that can be passed around (instead of writing to a file), simply create an empty instance of io.BytesIO:
from io import BytesIO
new_bytes_object = BytesIO()
Then, use pdfrw's PdfWriter.write() method to write your data to the empty BytesIO object:
pdfrw.PdfWriter.write(new_bytes_object, filled_pdf)
# I'm not sure about the syntax, I haven't used this lib before
This works because io.BytesIO objects act like a file object, also known as a file-like object. It and related classes like io.StringIO behave like files in memory, such as the object f created with the built-in function open below:
with open("output.txt", "a") as f:
f.write(some_data)
Before you attempt to read from new_bytes_object, don't forget to seek(0) back to the beginning, or rewind it. Otherwise, the object seems empty.
new_bytes_object.seek(0)

xhtml2pdf: Output generated PDF as in-memory object (its bytes)

I'm working with Python 3, Django and the xhtml2pdf package.
I want to create a PDF from an HTML string, but I don't want to write the PDF on disk, but rather just to get its bytes from memory, as in using BytesIO or StringIO.
I've read the xhtml2pdf documentation. This is the closest I've found related to what I need:
In-memory files can be generated by using StringIO or cStringIO instead of the file open. Advanced options will be discussed later in this document.
And this is one of the latest things I've tried:
def html_to_pdf(html):
"""Writes a PDF file using xhtml2pdf from a given HTML stream
Parameters
----------
html : str
A HTML valid string.
Returns
-------
bytes
A bytes sequence containing the rendered PDF.
"""
output = BytesIO()
pisa_status = pisa.CreatePDF(html, dest=output)
return new_output.read()
But this isn't working.
Any idea how to output the generated PDF as a in-memory object and thus return its bytes?
I think your return statement is using new_output instead of output.
However, the real issue should be something else, have you tried calling output.seek(0) before reading its bytes with output.read()?
What you can also do is output.getvalue(). This will get the entire contents of the BytesIO object.

Power BI(PBIX) - Parsing Layout file

I am trying to document the Reports, Visuals and measures used in a PBIX file. I have a PBIX file(containing some visuals and pointing to Tabular Model in Live Mode), I then exported it as a PBIT, renamed to zip. Now in this zip file we have a folder called Report, within that we have a file called Layout. The layout file looks like a JSON file but when I try to read it via python,
import json
# Opening JSON file
f = open("C://Layout",)
# returns JSON object as
# a dictionary
#f1 = str.replace("\'", "\"")
data = json.load(f)
I get below issue,
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Renaming it to Layout.json doesn't help either and gives the same issue. Is there a easy way or a parser to specifically parse this Layout file and get below information out of it
Report Name | Visual Name | Column or Measure Name
Not sure if you have come across an answer for your question yet, but I have been looking into something similar.
Here is what I have had to do in order to get the file to parse correctly.
Big items here to not is the encoding and all the whitespace replacements.
data will then contain the parsed object.
with open('path/to/Layout', 'r', encoding="cp1252") as json_file:
data_str = json_file.read().replace(chr(0), "").replace(chr(28), "").replace(chr(29), "").replace(chr(25), "")
data = json.loads(data_str)
This script may help: https://github.com/grenzi/powerbi-model-utilization
a portion of the script is:
def get_layout_from_pbix(pbixpath):
"""
get_layout_from_pbix loads a pbix file, grabs the layout from it, and returns json
:parameter pbixpath: file to read
:return: json goodness
"""
archive = zipfile.ZipFile(pbixpath, 'r')
bytes_read = archive.read('Report/Layout')
s = bytes_read.decode('utf-16-le')
json_obj = json.loads(s, object_hook=parse_pbix_embedded_json)
return json_obj
had similar issue.
my work around was to save it as Layout.txt with utf-8 encoding, then continued as you have

Python: Converting Entire Directory of JSON to Python Dictionaries to send to MongoDB

I'm relatively new to Python, and extremely new to MongoDB (as such, I'll only be concerned with taking the text files and converting them). I'm currently trying to take a bunch of .txt files that are in JSON to move them into MongoDB. So, my approach is to open each file in the directory, read each line, convert it from JSON to a dictionary, and then over-write that line that was JSON as a dictionary. Then it'll be in a format to send to MongoDB
(If there's any flaw in my reasoning, please point it out)
At the moment, I've written this:
"""
Kalil's step by step iteration / write.
JSON dumps takes a python object and serializes it to JSON.
Loads takes a JSON string and turns it into a python dictionary.
So we return json.loads so that we can take that JSON string from the tweet and save it as a dictionary for Pymongo
"""
import os
import json
import pymongo
rootdir='~/Tweets'
def convert(line):
line = file.readline()
d = json.loads(lines)
return d
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f=open(file, 'r')
lines = f.readlines()
f.close()
f=open(file, 'w')
for line in lines:
newline = convert(line)
f.write(newline)
f.close()
But it isn't writing.
Which... As a rule of thumb, if you're not getting the effect that you're wanting, you're making a mistake somewhere.
Does anyone have any suggestions?
When you decode a json file you don't need to convert line by line as the parser will iterate over the file for you (that is unless you have one json document per line).
Once you've loaded the json document you'll have a dictionary which is a data structure and cannot be directly written back to file without first serializing it into a certain format such as json, yaml or many others (the format mongodb uses is called bson but your driver will handle the encoding for you).
The overall process to load a json file and dump it into mongo is actually pretty simple and looks something like this:
import json
from glob import glob
from pymongo import Connection
db = Connection().test
for filename in glob('~/Tweets/*.txt'):
with open(filename) as fp:
doc = json.load(fp)
db.tweets.save(doc)
a dictionary in python is an object that lives within the program, you can't save the dictionary directly to a file unless you pickle it (pickling is a way to save objects in files so you can retrieve it latter). Now I think a better approach would be to read the lines from the file, load the json which converts that json to a dictionary and save that info into mongodb right away, no need to save that info into a file.

generate xml with sax2 in python

I have a data model or an object from a class, and I need to initialize it by reading from an xml file, or create this object from scratch and output it to an xml file. Previously, I simply use string operations from python to read xml (file.read + string.find) and write xml (file.write), without error checking.
Now I am thinking to use Sax2 to do this. I know how to do it for the read, but not very clear about write. It looks like the sax2 is used for the case when there is an original xml and you want to output after certain modifications. In my case I want to output my data model to xml, with no original xml at all. I wonder if sax2 is good or suitable for this or I should keep using my old way. What is the better way to input/output a class object from/to XML with python? The class is very simple (just a list collection of a list information, i.e., root -> children -> grandchildren) and small size.
Thanks for any suggestions.
Try the pythonic XML processing way: ElementTree.
Generating XML output is easy with`xml.etree.ElementTree.ElementTree.write().
write(file, encoding="us-ascii", xml_declaration=None, method="xml")
Writes the element tree to a file, as XML. file is a file name, or a file object opened for writing. encoding 1 is the output encoding (default is US-ASCII). xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None). method is either "xml", "html" or "text" (default is "xml"). Returns an encoded string.
Example loading ElementTree object from text file:
>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("index.xhtml")

Categories