Reading XML file's content from AWS S3 bucket using boto3 library - python

I am trying to read the content of an XML file for parsing using the BOTO3 library and getting below error while doing that.
I am using the below python code.
import xml.etree.ElementTree as et
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_name')
key = 'audit'
for obj in bucket.objects.filter(Prefix="Folder/XML.xml"):
key = obj.key
body = obj.get()['Body'].read()
parsed_xml = et.fromstring(body)
I am getting below error while printing parsed_xml variable or body.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in ()
----> 1 parsed
NameError: name 'parsed_xml' is not defined
If I will print body in the above code, it should be shown in XML tags.

You have to define 'parsed_xml' outside the 'for' sentence.
parsed_xml = ''

Related

Trying to read a config file in order to connect to twitter API

I am brand new at all of this and I am completely lost even after Googling, watching hours of youtube videos, and reading posts on this site for the past week.
I am using Jupyter notebook
I have a config file with my api keys it is called config.ipynb
I have a different file where I am trying to call?? (I am not sure if this is the correct terminology) my config file so that I can connect to the twitter API but I getting an attribute error
Here is my code
import numpy as np
import pandas as pd
import tweepy as tw
import configparser
#Read info from the config file named config.ipynb
config = configparser.ConfigParser()
config.read(config.ipynb)
api_key = config[twitter][API_key]
print(api_key) #to test if I did this correctly`
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [17], line 4
1 #Read info from the config file named config.ipynb
3 config = configparser.ConfigParser()
----> 4 config.read(config.ipynb)
5 api_key = config[twitter][API_key]
AttributeError: 'ConfigParser' object has no attribute 'ipynb'
After fixing my read() mistake I received a MissingSectionHeaderError.
MissingSectionHeaderError: File contains no section headers.
file: 'config.ipynb', line: 1 '{\n'.
My header in my config file is [twitter] but that gives me a NameError and say [twitter] is not defined... I have updated this many times per readings but I always get the same error.
My config.ipynb file code is below:
['twitter']
API_key = "" #key between the ""
API_secret = "" #key between the ""
Bearer_token = "" #key between the ""
Client_ID = "" #key between the ""
Client_Secret = "" #key between the ""
I have tried [twitter], ['twitter'], and ["twitter"] but all render a MissingSectionHeaderError:
Per your last comment in Brance's answer, this is probably related to your file path. If your file path is not correct, configparser will raise a KeyError or NameError.
Tested and working in Jupyter:
Note that no quotation marks such as "twitter" are used
# stackoverflow.txt
[twitter]
API_key = 6556456fghhgf
API_secret = afsdfsdf45435
import configparser
import os
# Define file path and make sure path is correct
file_name = "stackoverflow.txt"
# Config file stored in the same directory as the script.
# Get currect working directory with os.getcwd()
file_path = os.path.join(os.getcwd(), file_name)
# Confirm that the file exists.
assert os.path.isfile(file_path) is True
# Read info from the config file named stackoverflow.txt
config = configparser.ConfigParser()
config.read(file_path)
# Will raise KeyError if the file path is not correct
api_key = config["twitter"]["API_key"]
print(api_key)
You are using the read() method incorrectly, the input should be a string of the filename, so if your filename is config.ipynb then you need to set the method to
config.read('config.ipynb')

Create PDF from HTML using AWS Lambda and consuming imgs from a S3 Bucket with Python

I have this issue in which I hope that someone can help me with.
So I have a process that saves some images into a S3 bucket.
Then, I have a lambda process, that using python, it's supposed to create a PDF file, displaying these images.
I'm using the library xhtml2pdf for that, which I've uploaded to my lambda environment as a layer.
My 1st approach was to download the image from the S3 bucket, and save it into the lambda '/tmp', but I was getting this error from xhtml2pdf:
Traceback (most recent call last):
File "/opt/python/xhtml2pdf/xhtml2pdf_reportlab.py", line 359, in __init__
raise RuntimeError('Imaging Library not available, unable to import bitmaps only jpegs')
RuntimeError: Imaging Library not available, unable to import bitmaps only jpegs fileName=
<_io.BytesIO object at 0x7f1eaabe49a0>
Then I thought that if I had it being transformed into a base64 file, that this issue would be solved, but then I got the same error.
Can anybody here, please, give me some guidance about the best way to do this ?
Thank you
This is a small piece of my lambda code:
from xhtml2pdf import pisa
def getFileFromS3(fileKey, fileName):
try:
localFileName = f'/tmp/{fileName}'
bot_utils.log(f'fileKey : {fileKey}')
bot_utils.log(f'fileName : {fileName}')
bot_utils.log(f'localFileName : {localFileName}')
s3 = boto3.client('s3')
bucketName = 'fileholder'
s3.download_file(bucketName, fileKey, localFileName)
return 'data:image/jpeg;base64,' + getImgBase64( localFileName )
except botocore.exceptions.ClientError as e:
raise
htmlText = '<table>'
for i in range(0, len(shoppingLines), 2):
product = shoppingLines[i]
text = product['text']
folderName = product['folder']
tmpFile = getFileFromS3(f"pannings/{folderName}/{product['photo_id']}.jpg", f"{product['photo_id']}.jpg")
htmlText += f"""<tr><td align="center"><img src="{tmpFile}" width="40" height="55"></td><td>{text}</td></tr>"""
htmlText += '</table>'
result_file = open('/tmp/file.pdf', "w+b")
pisa_status = pisa.CreatePDF(htmlText ,dest=result_file)
result_file.close()
For future google searches.
Seems like the issue is with the PIL/Pillow library.
I've found a version of these library on this GIT repo (https://github.com/keithrozario/Klayers)
When I use this version, it works...

Using urllib.request to write an image

I am trying to use this code to download an image from the given URL
import urllib.request
resource = urllib.request.urlretrieve("http://farm2.static.flickr.com/1184/1013364004_bcf87ed140.jpg")
output = open("file01.jpg","wb")
output.write(resource)
output.close()
However, I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-39-43fe4522fb3b> in <module>()
41 resource = urllib.request.urlretrieve("http://farm2.static.flickr.com/1184/1013364004_bcf87ed140.jpg")
42 output = open("file01.jpg","wb")
---> 43 output.write(resource)
44 output.close()
TypeError: a bytes-like object is required, not 'tuple'
I get that its the wrong data type for the .write() object but I don't know how to feed resource into output
Right, Using urllib.request.urlretrieve like this way:
import urllib.request
resource, headers = urllib.request.urlretrieve("http://farm2.static.flickr.com/1184/1013364004_bcf87ed140.jpg")
image_data = open(resource, "rb").read()
with open("file01.jpg", "wb") as f:
f.write(image_data)
PS: urllib.request.urlretrieve return a tuple, the first element is the location of temp file, you could try to get the bytes of temp file, and save it to a new file.
In Official document:
The following functions and classes are ported from the Python 2 module urllib (as opposed to urllib2). They might become deprecated at some point in the future.
So I would recommend you to use urllib.request.urlopen,try code below:
import urllib.request
resource = urllib.request.urlopen("http://farm2.static.flickr.com/1184/1013364004_bcf87ed140.jpg")
output = open("file01.jpg", "wb")
output.write(resource.read())
output.close()

Can't read from XML file in S3 with Python

I have an XML file sitting in S3 and I need to open it from a lambda function and write strings to a DynamoDB table. I am using etree to parse the file. However, I don't think any content is actually getting read from the file. Below is my code, the error, and some sample xml.
Code:
import boto3
import lxml
from lxml import etree
def lambda_handler(event, context):
output = 'Lambda ran successfully!'
return output
def WriteItemToTable():
s3 = boto3.resource('s3')
obj = s3.Object('bucket', 'object')
body = obj.get()['Body'].read()
image_id = etree.fromstring(body.content).find('.//IMAGE_ID').text
print(image_id)
WriteItemToTable()
Error:
'str' object has no attribute 'content'
XML:
<HOST_LIST>
<HOST>
<IP network_id="X">IP</IP>
<TRACKING_METHOD>EC2</TRACKING_METHOD>
<DNS><![CDATA[i-xxxxxxxxxx]]></DNS>
<EC2_INSTANCE_ID><![CDATA[i-xxxxxxxxx]]></EC2_INSTANCE_ID>
<EC2_INFO>
<PUBLIC_DNS_NAME><![CDATA[xxxxxxxxxxxx]]></PUBLIC_DNS_NAME>
<IMAGE_ID><![CDATA[ami-xxxxxxx]]></IMAGE_ID>
I am trying to pull the AMI ID inside of the <IMAGE_ID> tag.
Content is read, what you get is just an attribute error. body is already a string and it has no content attribute. Instead of fromstring(body.content) just do fromstring(body).

Can anyone tell me what error msg "line 1182 in parse" means when I'm trying to parse and xml in python

This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.

Categories