Can't read from XML file in S3 with Python - python

I have an XML file sitting in S3 and I need to open it from a lambda function and write strings to a DynamoDB table. I am using etree to parse the file. However, I don't think any content is actually getting read from the file. Below is my code, the error, and some sample xml.
Code:
import boto3
import lxml
from lxml import etree
def lambda_handler(event, context):
output = 'Lambda ran successfully!'
return output
def WriteItemToTable():
s3 = boto3.resource('s3')
obj = s3.Object('bucket', 'object')
body = obj.get()['Body'].read()
image_id = etree.fromstring(body.content).find('.//IMAGE_ID').text
print(image_id)
WriteItemToTable()
Error:
'str' object has no attribute 'content'
XML:
<HOST_LIST>
<HOST>
<IP network_id="X">IP</IP>
<TRACKING_METHOD>EC2</TRACKING_METHOD>
<DNS><![CDATA[i-xxxxxxxxxx]]></DNS>
<EC2_INSTANCE_ID><![CDATA[i-xxxxxxxxx]]></EC2_INSTANCE_ID>
<EC2_INFO>
<PUBLIC_DNS_NAME><![CDATA[xxxxxxxxxxxx]]></PUBLIC_DNS_NAME>
<IMAGE_ID><![CDATA[ami-xxxxxxx]]></IMAGE_ID>
I am trying to pull the AMI ID inside of the <IMAGE_ID> tag.

Content is read, what you get is just an attribute error. body is already a string and it has no content attribute. Instead of fromstring(body.content) just do fromstring(body).

Related

how to use lxml iterparse from Azure StorageStreamDownloader?

I'm currently using lxml.etree.iterparse to iterate over an XML file tag by tag. Locally this works fine but I want to move the XML file to an Azure Blob Storage and process the file in an Azure function. However, I'm a bit stuck on trying to parse the XML file from the StorageStreamDownloader
Code locally
from lxml import etree
context = etree.iterparse('c:\\Users\\', tag='InstanceElement')
for event, elem in context:
# processing of the tag
Streaming from Blob
from lxml import etree
from azure.storage.filedatalake import DataLakeServiceClient
connect_str = ''
service = DataLakeServiceClient.from_connection_string(conn_str=connect_string)
System = service.get_file_system_client('')
FileClient = System.get_file_client('')
Stream = FileClient.download_file()
# Stuck on what the input must be for iterparse
context = etree.iterparse(, tag='InstanceElement')
for event, elem in context:
# processing of the tag
I'm stuck at what the input of iterparse must be, so any ideas on how to parse the XML file while streaming it?
Try this :
from lxml import etree
from azure.storage.filedatalake import DataLakeServiceClient
from io import BytesIO
connect_str = ''
service = DataLakeServiceClient.from_connection_string(conn_str=connect_str)
System = service.get_file_system_client('')
FileClient = System.get_file_client('test.xml')
content = FileClient.download_file().readall()
context = etree.iterparse(BytesIO(content), tag='InstanceElement')
for event, elem in context:
print(elem.text)
Content of my test.xml:
Result:

Reading doc, docx files from s3 within lambda

TLDR; reading with my AWS lambda doc, docx files that are stored on S3.
On my local machine I just use textract.process(file_path) to read both doc and docx files.
So the intuitive way to do the same on lambda is to download the file from s3 to the local storage (tmp) on the lambda and then process the tmp files like I do on my local machine.
That's not cost-effective...
Is there a way to make a pipeline from the S3 object straight into some parser like textract that'll just convert the doc/docx files into a readable object like string?
My code so far for reading files like txt.
import boto3
print('Loading function')
def lambda_handler(event, context):
try: # Read s3 file
bucket_name = "appsresults"
download_path = 'Folder1/file1.txt'
filename = download_path
s3 = boto3.resource('s3')
content_object = s3.Object(bucket_name, filename)
file_content = content_object.get()['Body'].read().decode('utf-8')
print(file_content)
except Exception as e:
print("Couldnt read the file from s3 because:\n {0}".format(e))
return event # return event
This answer solves half of the problem
textract.process currently doesn't support reading file-like objects. If it did, you could have directly loaded the file from S3 into memory and pass it to the process function.
Older version of textract internally used python-docx package for reading .docx files. python-docx supports reading file-like objects. You can use the below code to achieve your goal, at least for .docx files.
import boto3
import io
from docx import Document
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
object = bucket.Object('/files/resume.docx')
file_stream = io.BytesIO()
object.download_fileobj(file_stream)
document = docx.Document(file_stream)
If you're reading the docx file from s3, Document() constructor expects path for the file. Instead, you can read the file in byte format and call the constructor like this.
from docx import Document
def parseDocx(data):
data = io.BytesIO(data)
document = Document(docx = data)
content = ''
for para in document.paragraphs:
data = para.text
content+= data
return content
Key = "acb.docx"
Bucket = "xyz"
obj_ = s3_client.get_object(Bucket= Bucket, Key=Key)
if Key.endswith('.docx'):
fs = obj_['Body'].read()
sentence = str(parseDocx(fs))

Reading XML file's content from AWS S3 bucket using boto3 library

I am trying to read the content of an XML file for parsing using the BOTO3 library and getting below error while doing that.
I am using the below python code.
import xml.etree.ElementTree as et
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_name')
key = 'audit'
for obj in bucket.objects.filter(Prefix="Folder/XML.xml"):
key = obj.key
body = obj.get()['Body'].read()
parsed_xml = et.fromstring(body)
I am getting below error while printing parsed_xml variable or body.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in ()
----> 1 parsed
NameError: name 'parsed_xml' is not defined
If I will print body in the above code, it should be shown in XML tags.
You have to define 'parsed_xml' outside the 'for' sentence.
parsed_xml = ''

How to parse xml from local disk file in python?

I have a code like this :
import requests
user_agent_url = 'http://www.user-agents.org/allagents.xml'
xml_data = requests.get(user_agent_url).content
Which will parse a online xml file into xml_data. How can I parse it from a local disk file? I tried replacing with path to local disk,but got an error:
raise InvalidSchema("No connection adapters were found for '%s'" % url)
InvalidSchema: No connection adapters were found
What has to be done?
Note that the code you quote does NOT parse the file - it simply puts the XML data into xml_data. The equivalent for a local file doesn't need to use requests at all: simply write
with open("/path/to/XML/file") as f:
xml_data = f.read()
If you are determined to use requests then see this answer for how to write a file URL adapter.
You can read the file content using open method and then use elementtree module XML function to parse it.
It returns an etree object which you can loop through.
Example
Content = open("file.xml").read()
From xml.etree import XML
Etree = XML(Content)
Print Etree.text, Etree.value, Etree.getchildren()

Can anyone tell me what error msg "line 1182 in parse" means when I'm trying to parse and xml in python

This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.

Categories