how to use lxml iterparse from Azure StorageStreamDownloader? - python

I'm currently using lxml.etree.iterparse to iterate over an XML file tag by tag. Locally this works fine but I want to move the XML file to an Azure Blob Storage and process the file in an Azure function. However, I'm a bit stuck on trying to parse the XML file from the StorageStreamDownloader
Code locally
from lxml import etree
context = etree.iterparse('c:\\Users\\', tag='InstanceElement')
for event, elem in context:
# processing of the tag
Streaming from Blob
from lxml import etree
from azure.storage.filedatalake import DataLakeServiceClient
connect_str = ''
service = DataLakeServiceClient.from_connection_string(conn_str=connect_string)
System = service.get_file_system_client('')
FileClient = System.get_file_client('')
Stream = FileClient.download_file()
# Stuck on what the input must be for iterparse
context = etree.iterparse(, tag='InstanceElement')
for event, elem in context:
# processing of the tag
I'm stuck at what the input of iterparse must be, so any ideas on how to parse the XML file while streaming it?

Try this :
from lxml import etree
from azure.storage.filedatalake import DataLakeServiceClient
from io import BytesIO
connect_str = ''
service = DataLakeServiceClient.from_connection_string(conn_str=connect_str)
System = service.get_file_system_client('')
FileClient = System.get_file_client('test.xml')
content = FileClient.download_file().readall()
context = etree.iterparse(BytesIO(content), tag='InstanceElement')
for event, elem in context:
print(elem.text)
Content of my test.xml:
Result:

Related

Upload a modified XML file to google cloud storage after editting it with ElementTree (python)

I've modified a piece of code for merging two or more xml files into one. I got it working locally without using or storing files on google cloud storage.
I'd like to use it via cloud functions, which seems to work mostly fine, apart from uploading the final xml file to google cloud storage.
import os
import wget
import logging
from io import BytesIO
from google.cloud import storage
from xml.etree import ElementTree as ET
def merge(event, context):
client = storage.Client()
bucket = client.get_bucket('mybucket')
test1 = bucket.blob("xml-file1.xml")
inputxml1 = test1.download_as_string()
root1 = ET.fromstring(inputxml1)
test2 = bucket.blob("xml-file2.xml")
inputxml2 = test2.download_as_string()
root2 = ET.fromstring(inputxml2)
copy_files = [e for e in root1.findall('./SHOPITEM')]
src_files = set([e.find('./SHOPITEM') for e in copy_files])
copy_files.extend([e for e in root2.findall('./SHOPITEM') if e.find('./CODE').text not in src_files])
files = ET.Element('SHOP')
files.extend(copy_files)
blob = bucket.blob("test.xml")
blob.upload_from_string(files)
Ive tried the functions .write and .tostring but unsuccessfully.
Sorry for the incomplete question. I've already found a solution and I cant recall the error message I got.
Here is my solution:
blob.upload_from_string(ET.tostring(files, encoding='UTF-8',xml_declaration=True, method='xml').decode('UTF-8'),content_type='application/xml')

pdfminer error message: pdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed

I need to process some PDF files and add their form field contents in a database.
This document has not Security Method set, as I can see in the PDF Viewer document properties.
I tried the suggestions I found here.
When I test using pdfminer (or pdfminer.six), I didn't get an error message, but it didn't retrieve any field.
Using PyPDF2, I get the error message: "file has not been decrypted."
This is the pdfminer code:
import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fname=r'D:\Atrium\Projects\CTFC\psgf\database\19022021\formulari-dinamic-redaccio-plans-simples-gestio-forestal_Filled.pdf'
fp = open(fname, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
print('{0}: {1}'.format(name, value))
print('Done!')
A sample file can be download here.
How can I do to obtain field names and content?
As mkl explained, my PDF files store form data in XFA form, a deprecated format. The XFA is an array of XML docs and I have to procure field names in each one of these docs.
I used PyPDF2 library to do that:
import PyPDF2 as pypdf
import xml.etree.ElementTree as ET
fname=r'form.pdf'
def findInDict(needle, haystack):
xlas = []
for key in haystack.keys():
try:
value=haystack[key]
except:
continue
if key==needle:
return value
if isinstance(value,dict):
x=findInDict(needle,value)
if x is not None:
return x
pdfobject=open(fname,'rb')
pdf=pypdf.PdfFileReader(pdfobject)
xfaparts=findInDict('/XFA',pdf.resolvedObjects)
for xfa in xfaparts:
if isinstance(xfa,pypdf.generic.IndirectObject):
xml = str(xfa.getObject().getData())
## Then process XML to find form tags

How to parse xml from local disk file in python?

I have a code like this :
import requests
user_agent_url = 'http://www.user-agents.org/allagents.xml'
xml_data = requests.get(user_agent_url).content
Which will parse a online xml file into xml_data. How can I parse it from a local disk file? I tried replacing with path to local disk,but got an error:
raise InvalidSchema("No connection adapters were found for '%s'" % url)
InvalidSchema: No connection adapters were found
What has to be done?
Note that the code you quote does NOT parse the file - it simply puts the XML data into xml_data. The equivalent for a local file doesn't need to use requests at all: simply write
with open("/path/to/XML/file") as f:
xml_data = f.read()
If you are determined to use requests then see this answer for how to write a file URL adapter.
You can read the file content using open method and then use elementtree module XML function to parse it.
It returns an etree object which you can loop through.
Example
Content = open("file.xml").read()
From xml.etree import XML
Etree = XML(Content)
Print Etree.text, Etree.value, Etree.getchildren()

Can't read from XML file in S3 with Python

I have an XML file sitting in S3 and I need to open it from a lambda function and write strings to a DynamoDB table. I am using etree to parse the file. However, I don't think any content is actually getting read from the file. Below is my code, the error, and some sample xml.
Code:
import boto3
import lxml
from lxml import etree
def lambda_handler(event, context):
output = 'Lambda ran successfully!'
return output
def WriteItemToTable():
s3 = boto3.resource('s3')
obj = s3.Object('bucket', 'object')
body = obj.get()['Body'].read()
image_id = etree.fromstring(body.content).find('.//IMAGE_ID').text
print(image_id)
WriteItemToTable()
Error:
'str' object has no attribute 'content'
XML:
<HOST_LIST>
<HOST>
<IP network_id="X">IP</IP>
<TRACKING_METHOD>EC2</TRACKING_METHOD>
<DNS><![CDATA[i-xxxxxxxxxx]]></DNS>
<EC2_INSTANCE_ID><![CDATA[i-xxxxxxxxx]]></EC2_INSTANCE_ID>
<EC2_INFO>
<PUBLIC_DNS_NAME><![CDATA[xxxxxxxxxxxx]]></PUBLIC_DNS_NAME>
<IMAGE_ID><![CDATA[ami-xxxxxxx]]></IMAGE_ID>
I am trying to pull the AMI ID inside of the <IMAGE_ID> tag.
Content is read, what you get is just an attribute error. body is already a string and it has no content attribute. Instead of fromstring(body.content) just do fromstring(body).

Can anyone tell me what error msg "line 1182 in parse" means when I'm trying to parse and xml in python

This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.

Categories