zipfile.BadZipFile: File is not a zip file when using "openpyxl" engine - python

I have created a script which dumps the excel sheets stored in S3 into my local postgres database. I've used pandas read_excel and ExcelFile method to read the excel sheets.
Code for the same can be found here.
import boto3
import pandas as pd
import io
import os
from sqlalchemy import create_engine
import xlrd
os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxxxxxxxxxxxx"
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket-name', Key='file.xlsx')
data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
print(data.sheet_names)
a = len(data.sheet_names)
engine1 = create_engine('postgresql://postgres:postgres#localhost:5432/postgres')
for i in range(a):
df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i], engine='openpyxl')
df.to_sql("test"+str(i), engine1, index=False)
Basically, code parses the S3 bucket and runs in a loop. For each sheet, it creates a table
and dumps the data from sheet in that table.
Where I'm having trouble is, when I run this code, I get this error.
df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i-1], engine='openpyxl')
zipfile.BadZipFile: File is not a zip file
This is coming after I added 'openpyxl' engine in read_excel method. When I remove the engine, I get this error.
raise ValueError(
ValueError: Excel file format cannot be determined, you must specify an engine manually.
Please note that I can print the connection to database, so there is no problem in connectivity, and I'm using latest version of python and pandas. Also, I can get all the sheet_names in the excel file so I'm able to reach to that file as well.
Many Thanks!

You are reading the obj twice, fully:
data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
pd.read_excel(io.BytesIO(obj['Body'].read()), ...)
Your object can only be .read() once, second read produce nothing, an empty b"".
In order to avoid re-reading the S3 stream many times, you could store it once in a BytesIO, and rewind that BytesIO with seek.
buf = io.BytesIO(obj["Body"].read())
pd.ExcelFile(buf)
buf.seek(0)
pd.read_excel(buf, ...)
# repeat

Related

Bigquery to S3 File upload using Python: raise ValueError('Filename must be a string')

I'm trying to extract a data from BQ and save it to CSV and then upload it to s3, but I'm having error with the uploading to s3. This is the error I get when I run the script:
raise ValueError('Filename must be a string')
If you can please help me solve this issue, I'm new to Python and AWS. Thank you
Script is:
rows_df = query_job.result().to_dataframe()
file_csv = rows_df.to_csv(s3_filename, sep='|', index=False, encoding='utf-8')
s3.upload_file(file_csv, s3_bucket, file_csv)
Try changing the arguments passed to s3.upload_file like so:
s3.upload_file(s3_filename, s3_bucket, s3_filename)
The to_csv writes the dataframe to a local file at path s3_filename and file_csv is None. Alternatively, if your dataframe is small enough to be held in memory, the following should do the trick:
import io
data = rows_df.to_csv(sep='|', index=False, encoding='utf-8')
data_buffer = io.BytesIO(data)
s3.upload_fileobj(data_buffer, s3_bucket, s3_filename)
Based on pandas doc, to_csv returns None when path_or_buf is specified. However, upload_file needs a filename and a S3 key in its first and third argument respectively. Therefore, something like this could make this work.
s3.upload_file(s3_filename, s3_bucket, s3_filename)

FileNotFoundError when reading .h5 file from S3 in python using Pandas

For some reason, when I attempt to read a hdf file from S3 using the pandas.read_hdf() method, I get a FileNotFoundError when I put an s3 url. The file definitely exists and I have tried using the pandas.read_csv() method with a csv file in the same s3 directory and that works. Is there something else I need to be doing? Here's the code:
import boto3
import h5py
import s3fs
import pandas as pd
csvDataframe = pd.read_csv('s3://BUCKET_NAME/FILE_NAME.csv', key='df')
print("Csv data:")
print(csvDataframe)
dataframe = pd.read_hdf('s3://BUCKET_NAME/FILE_NAME.h5', key='df')
print("Hdf data:")
print(dataframe)
Here is the error:
FileNotFoundError: File s3://BUCKET_NAME/FILE_NAME.h5 does not exist
In the actual code, BUCKET_NAME and FILE_NAME are replaced with their actual strings.
Please make sure file extension is .h5

Any way to send an xlsxwriter generated file to azure data lake without writing to local disk?

For purposes of security, I have a need to move a file to Azure Datalake storage without writing the file locally. This is an excel workbook that is being created with the xlsxwriter package. Here is what I have tried which returns a ValueError: Seek only available in read mode
import pandas as pd
from azure.datalake.store import core, lib, multithread
import xlsxwriter as xl
# Dataframes have undergone manipulation not listed in this code and come from a DB connection
matrix = pd.DataFrame(Database_Query1)
raw = pd.DataFrame(Database_Query2)
# Name datalake path for workbook
dlpath = '/datalake/file/path/file_name.xlsx'
# List store name
store_name = 'store_name_here'
# Create auth token
token = lib.auth(tenant_id= 'tenant_id_here',
client_id= 'client_id_here',
client_secret= 'client_secret_here')
# Create management file system client object
adl = core.AzureDLFileSystem(token, store_name= store_name)
# Create workbook structure
writer = pd.ExcelWriter(adl.open(dlpath, 'wb'), engine= 'xlsxwriter')
matrix.to_excel(writer, sheet_name= 'Compliance')
raw.to_excel(writer, sheet_name= 'Raw Data')
writer.save()
Any ideas? Thanks in advance.
If the data is not monstrously huge, you might consider keeping the bytes in memory and dump the stream back to your adl:
from io import BytesIO
xlb = BytesIO()
# ... do what you need to do ... #
writer = pd.ExcelWriter(xlb, engine= 'xlsxwriter')
matrix.to_excel(writer, sheet_name= 'Compliance')
raw.to_excel(writer, sheet_name= 'Raw Data')
writer.save()
# Set the cursor of the stream back to the beginning
xlb.seek(0)
with adl.open(dlpath, 'wb') as fl:
# This part I'm not entirely sure - consult what your adl write methods are
fl.write(xlb.read())

How to read only 5 records from s3 bucket and return it without getting all data of csv file

Hello guys I know lots of similar questions i'll find here but i have a code which is executing properly which is returning five records also my query is how should i only read the entire file and atlast return with desire rows just supose i have csv file which have size in gb so i don't want to return the entire gb file data for getting only 5 records so please tell me how should i get it....Please if possible explain my code if it is not good why it is not good..
code:
import boto3
from botocore.client import Config
import pandas as pd
ACCESS_KEY_ID = 'something'
ACCESS_SECRET_KEY = 'something'
BUCKET_NAME = 'something'
Filename='dataRepository/source/MergedSeedData(Parts_skills_Durations).csv'
client = boto3.client("s3",
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_SECRET_KEY)
obj = client.get_object(Bucket=BUCKET_NAME, Key=Filename)
Data = pd.read_csv(obj['Body'])
# data1 = Data.columns
# return data1
Data=Data.head(5)
print(Data)
This my code which is running fine also getting the 5 records from s3 bucket but i have explained it what i'm looking for any other query feel free to text me...thnxx in advance
You can use the pandas capability of reading a file in chunks, just loading as much data as you need.
data_iter = pd.read_csv(obj['Body'], chunksize = 5)
data = data_iter.get_chunk()
print(data)
You can use a HTTP Range: header (see RFC 2616), which take a byte range argument. S3 APIs have a provision for this and this will help you to NOT read/download the whole S3 file.
Sample code:
import boto3
obj = boto3.resource('s3').Object('bucket101', 'my.csv')
record_stream = obj.get(Range='bytes=0-1000')['Body']
print(record_stream.read())
This will return only the byte_range_data provided in the header.
But you will need to modify this to convert the string into Dataframe. Maybe read + join for the \t and \n present in the string coming from the .csv file

Azure storage get_blob_to_stream cant download saved csv file as stream

I want to download a CSV file stored in Azure storage into a stream and directly used in my python script, but after I did this with help from Thomas, I cannot use pandas read_csv method, the error message is: pandas.io.common.EmptyDataError: No columns to parse from file,thus I assume the download CSV stream is actually empty, but after check in storage account, the CSV file is fine with all data inside it, what the problem here? below is the code from Thomas:
from azure.storage.blob import BlockBlobService
import io
from io import BytesIO, StringIO
import pandas as pd
from shutil import copyfileobj
with BytesIO() as input_blob:
with BytesIO() as output_blob:
block_blob_service = BlockBlobService(account_name='my account', account_key='mykey')
block_blob_service.get_blob_to_stream('my counter', 'datatest1.csv', input_blob)
df=pd.read_csv(input_blob)
print(df)
copyfileobj(input_blob, output_blob)
#print(output_blob)
# Create the a new blob
block_blob_service.create_blob_from_stream('my counter', 'datatest2.csv', output_blob)
if i dont execute the read_csv code, the create_blob_from_stream will create a empty file, but if i execute the read_csv code, i got error:
pandas.parser.TextReader.cinit (pandas\parser.c:6171)
pandas.io.common.EmptyDataError: No columns to parse from file
the download file stored fine in the blob storage with all data in it. as showing below:
i finally figure out, after spend so many time on this !
have to EXECUTE :
input_blob.seek(0)
to use the stream after save the stream to input_blob !!

Categories