Python FileStorage: Read XML file - python

Below is a simple example to write an XML file and read it back. The writing works OK, but I am not sure how to read this file back? Below is some sample code. How do I get thse values from the XML file?
file1 = 'result1.xml'
fs = cv2.FileStorage(file1, cv2.FILE_STORAGE_WRITE)
fs.write('var1', 1)
fs.write('var2', 2)
fs = cv2.FileStorage(file1,cv2.FILE_STORAGE_READ)
fn = fs.real

Python in different versions has its own library to parse XML data.
Here is where you can find the documentation : XML Library
You have to be careful when using it, as said in title of the webpage, this library is not safe if XML files aren't built properly.
Here is another useful website : How to parse XML files using Python ?

Related

Python: Unsupported format, or corrupt file

I am trying to make a python program that downloads and XLS file from a website, in this case website is: https://www.blackrock.com/uk/individual/products/291392/, and loads it as a dataframe in pandas, with the correct data structure.
The issue is that when I try to load it via pandas, it gives me an error: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf\xef\xbb\xbf<?'
I am not quite sure what is causing this error, but presumable something with the file. I can open the file in Excel, even though I get a warning that the file and the file extension do not match, and that the file might be dangerous etc. If I click yes to opening it anyway, it opens up with data displayed correctly. If I use Excel to save the file as .xlsx i can open it in pandas, but I would rather a solution that didn't require manually opening Excel and saving the file.
I have tried renaming the file extension to xlsx, but this does not work, as it won't allow me to open the file with that extension.
I have tried many different extension, but non of them bite - unfortunately.
I am at a loss.
I hope, you can help.
EDIT: The code I use is:
download_path = 'https://www.blackrock.com/uk/individual/products/291392/fund/1527484370694.ajax?fileType=xls&fileName=iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund&dataType=fund'
testing = pd.read_excel(download_path, engine='xlrd', sheet_name = 'Holdings', skiprows = 3)
The actual problem is that the file format is SpreadSheetML which has only been used briefly between 2003 and 2006. It has been overtaken by the XLSX format. Since, it has been around for a short time and while ago, most packages do not support for load/save operations. More about the format can be found here: https://learn.microsoft.com/en-us/previous-versions/office/developer/office-xp/aa140066(v=office.10)?redirectedfrom=MSDN
For this reason, the Pandas or any other XML parser (e.g Etree) will not be able to load properly. The regular MS Office software would still load it correctly. As far as I know, you can deal with SpreadSheetML files using aspose-cells package: https://products.aspose.com/cells/python-java/
For your case:
# Import packages
import jpype
import asposecells
jpype.startJVM()
from asposecells.api import Workbook, FileFormatType
from asposecells.api import HtmlSaveOptions
# Read Workbook
workbook = Workbook('iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.xls')
worksheet = workbook.getWorksheets().get(0)
# Accessing a cell using its name
cells = worksheet.getCells()
cell = cells.get("A1")
# Print Message
print("Cell Value: " + str(cell.getValue())) # Prints Cell Value: 17-Nov-2021
# To save SpreadSheetML in different format (HTML)
saveOptions = HtmlSaveOptions()
saveOptions.setDisableDownlevelRevealedComments(True)
workbook.save("iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.html", saveOptions)
As mentioned by Slybot, this is not a real xls file.
If you inspect the contents in a plain text editor, or a hex editor, the header starts:
<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
which confirms this is an xml document, and not an Office 2007 zipped xlsx office document.
Your next steps depend on whether you have Excel installed on the machine that will be running this code or not, and if not, what other libraries you have access to and are willing to pay for - Slybot has mentioned aspose for example.
The easiest solution - Excel
If you are running this on a Windows machine with Excel installed, you have the free and capable option of automating the operation of opening Excel and saving as xlsx. This is by using Win32com module, described in this answer:
Attempting to Parse an XLS (XML) File Using Python
Alternatively, save your Excel styled XML as xlsx with Workbook.SaveAs method using win32com (only for Windows users) and read in with pandas.read_excel skipping appropriate rows.
The XML solution
You could read in the raw XML and digest it. The relevant nodes are:
<ss:Workbook>
<ss:Worksheet ss:Name="Holdings">
<ss:Table>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">iShares MSCI World SRI UCITS ETF</ss:Data>
The Third-party library solution
I am not familiar with any libraries which provide this functionality, and can't advise on this option.

flask - python - markdown to html

I'm building my first website using flask and HTML. Some of my data that I want to migrate to this website resides in Markdown format. I am trying to convert Markdown into HTML using this however, I cannot get my hear around it:
https://github.com/Python-Markdown/markdown
I import it into my *.py file not sure what are the next steps after. This is what I got so far
from markdown import markdown
html = markdown.markdown(text)
not sure what should be put into the "text" variable. Also I have my markdown data residing in an html file how do I reference that from here? I have read through the installation guide but it's not very clear for me.
Thank you.
According to the docs located at https://python-markdown.github.io/reference/#using-markdown-as-a-python-library
text is supposed to contain your markdown text. In the below example found in the docs, some_file.txt would be the file containing your markdown.
input_file = codecs.open("some_file.txt", mode="r", encoding="utf-8")
text = input_file.read()
html = markdown.markdown(text)
To get your text, you would need to parse it out of the HTML. There are several ways of doing this but we would need more information about the file to proceed. Is your HTML file stored locally? Where in the file is the markdown? A MRE would be helpful

Put a XML file inside a Python script?

I'm trying to create a face-detection script using Python's OpenCV using the haar cascade XML file.
My goal is to upload a python file to a website but due to some weird policies, I can only upload the Python file, without the XML...
The question is, is it possible to somehow put the XML file inside the Python script, say, convert it to a String or something and then generate an XML from that String?
xml = """<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>Yes, you can embed XML in a string literal in Python.</b>
</a>"""
Not answer to title but answer of your description question.
Haar cascade doesn't support non-file XML strings. Also, if you try to put an XML file to a website and give a link to an XML file with cv2.CascadeClassifier(), it will give an error.
But you can use the request module on python to achieve what you want.
It gets XML from the website, then puts it into a file
def function(self, image):
# download XML from server
link = LINK_TO_XML
r = requests.get(link, allow_redirects=True)
open('haarcascade_frontalface_default.xml', 'wb').write(r.content)
# end of download
haar_cascade = cv.CascadeClassifier('haarcascade_frontalface_default.xml')
First, copy the contents of the XML file into the python file and assign the whole thing to a string. Then use XML library to create a tree type data structure named root which contains the contents of the XML file. This tree is traversable and you can do what you like with it in your program:
import xml.etree.ElementTree as ET
root = ET.fromstring(XML_file_example_as_string).
To generate XML from the string you can use ElementTree.write() like this:
tree = ET.ElementTree(root)
tree.write('example.xml')

Reading 1000s of XML documents with BeautifulSoup

I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file.
You can see a sample of the data hereWarning this will initiate a download of a 108MB zip file!. That's a huge xml file with thousands of smaller xml files inside it. I've broken those out into individual files. I want to rename the files based on a number inside (part of preprocessing). I have the following code:
from __future__ import print_function
from bs4 import BeautifulSoup # To get everything
import os
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print (full_filename)
f = open(full_filename, "r")
xml = f.read()
soup = BeautifulSoup(xml)
del xml
del soup
f.close()
If I comment out the "soup =" and "del" lines, it works perfectly. If I add the "soup = ..." line, it will work for a moment and then it will eventually crap out - it just crashes the python kernel. I'm using Enthought Canopy, but I've tried it running from the command line and it craps out there, too.
I thought, perhaps, it was not deallocating the space for the variable "soup" so I tried adding the "del" commands. Same problem.
Any thoughts on how to circumvent this? I'm not stuck on BS. If there's a better way of doing this, I would love it, but I need a little sample code.
Try using cElementTree.parse() from Python's standard xml library instead of BeautifulSoup. 'Soup is great for parsing normal web pages, but cElementTree is blazing fast.
Like this:
import xml.etree.cElementTree as cET
# ...
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print(full_filename)
parsed = cET.parse(full_filename)
del parsed
If your XML formatted correctly this should parse it. If your machine is still unable to handle all that data in memory, you should look into streaming the XML.
I would not separate that file out into many small files and then process them some more, I would process them all in one go.
I would just use a streaming api XML parser and parse the master file, get the name and write out the sub-files once with the correct name.
There is no need for BeautifulSoup which is primarily designed to handle HTML and uses a document model instead of a streaming parser.
There is no need for what you are doing to build an entire DOM just to get a single element all at once.

pdf file with python

How can search a word or a line in a pdf file?
Is there an existing module to do that by being concise?
Thank you in advance,
There's something called as pyPDF.
It is a Pure-Python library built as a PDF toolkit.
You can extract ( using extractText() method ) & also perform search on the pdf file with using something like following code.
pdf = pyPdf.PdfFileReader(file(path, "rb"))
content = pdf.getPage(1).extractText()

Categories