Pandas 0.25.0 and xlsx from response content stream

Pandas 0.25.0 and xlsx from response content stream - python

r = requests.get(projectsExportURL, auth=(username, password), verify=False,stream=True)
r.raw.decode_content = True
#add snapshot date column
df = pd.read_excel(r.raw,sheet_name='Tasks',Headers=0)
This worked just fine until 0.25.0 and xlrd 1.2.0
I recently had to re-do my entire environment and opted to update. The above code is now resulting in the following error:
File "d:\python\python37\lib\site-packages\pandas\io\excel\_base.py", line 356, in __init__
filepath_or_buffer.seek(0)
UnsupportedOperation: seek
if I remove xlrd from the equation pandas throws an error about an optional library missing (like if it is optional, why are you complaining).
So the incoming data is xlsx file format and I have to add a snapshot date to the file and then I send it to a MySQL database.
How can I fix my code to read the excel file with the changes to pandas, I can't seem to find anything in the docs that are specifically jumping out at me about this.

Here is my current replacement code that seems to be working:
wb = load_workbook(filename=BytesIO(r.raw.read()))
ws = wb['Tasks']
data = ws.values
columns = next(data)[0:]
df = pd.DataFrame(data, columns=columns)

This is how I solved this problem to download an xlsx excel file to pandas 1.0 DataFrame. This works on pandas >= 1.0
xl = requests.get(EXCEL_URL)
df = pd.read_excel(BytesIO(xl.content), sheet_name="Worksheet Name")
if sheet_name not given 1st sheet will be loaded.

Related

zipfile.BadZipFile: File is not a zip file when using "openpyxl" engine

I have created a script which dumps the excel sheets stored in S3 into my local postgres database. I've used pandas read_excel and ExcelFile method to read the excel sheets.
Code for the same can be found here.
import boto3
import pandas as pd
import io
import os
from sqlalchemy import create_engine
import xlrd
os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxxxxxxxxxxxx"
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket-name', Key='file.xlsx')
data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
print(data.sheet_names)
a = len(data.sheet_names)
engine1 = create_engine('postgresql://postgres:postgres#localhost:5432/postgres')
for i in range(a):
df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i], engine='openpyxl')
df.to_sql("test"+str(i), engine1, index=False)
Basically, code parses the S3 bucket and runs in a loop. For each sheet, it creates a table
and dumps the data from sheet in that table.
Where I'm having trouble is, when I run this code, I get this error.
df = pd.read_excel(io.BytesIO(obj['Body'].read()),sheet_name=data.sheet_names[i-1], engine='openpyxl')
zipfile.BadZipFile: File is not a zip file
This is coming after I added 'openpyxl' engine in read_excel method. When I remove the engine, I get this error.
raise ValueError(
ValueError: Excel file format cannot be determined, you must specify an engine manually.
Please note that I can print the connection to database, so there is no problem in connectivity, and I'm using latest version of python and pandas. Also, I can get all the sheet_names in the excel file so I'm able to reach to that file as well.
Many Thanks!

You are reading the obj twice, fully:
data = pd.ExcelFile(io.BytesIO(obj['Body'].read()))
pd.read_excel(io.BytesIO(obj['Body'].read()), ...)
Your object can only be .read() once, second read produce nothing, an empty b"".
In order to avoid re-reading the S3 stream many times, you could store it once in a BytesIO, and rewind that BytesIO with seek.
buf = io.BytesIO(obj["Body"].read())
pd.ExcelFile(buf)
buf.seek(0)
pd.read_excel(buf, ...)
# repeat

How do I write multiple dataframes to excel using python and download the results using streamlit?

I have a Python script using streamlit, that allows the user to upload certain excel files, then it automatically runs my anslysis on it, and then I want them to download the results in xlsx format using the streamlit download button. However, I know how to make them download one dataframe to a csv, but not an xlsx file using the streamlit download button, which is what I want to do.
Here's what I've tried so far, and this is after my analysis where I'm just trying to create the download button for the user to download the results that are stored in 3 different dataframes:
Import pandas as pd
Import streamlit as st
# arrived_clean, booked_grouped, and arrived_grouped are all dataframes that I want to export to an excel file as results for the user to download.
def convert_df():
writer = pd.ExcelWriter('test_data.xlsx', engine='xlsxwriter')
arrived_clean.to_excel(writer, sheet_name='Cleaned', startrow=0, startcol=0, index=False)
booked_grouped.to_excel(writer, sheet_name='Output', startrow=0, startcol=0, index=False)
arrived_grouped.to_excel(writer, sheet_name='Output', startrow=0, startcol=20, index=False)
writer.save()
csv = convert_df()
st.download_button(
label="Download data",
data=csv,
file_name='test_data.xlsx',
mime='text/xlsx',
)
When I first run the streamlit app locally I get this error:
"NameError: name 'booked_grouped' is not defined"
I get it because I haven't uploaded any files yet. After I upload my files the error message goes away and everything runs normally. However, I get this error and I don't see the download button to download my new dataframes:
RuntimeError: Invalid binary data format: <class 'NoneType'> line 313,
in marshall_file raise RuntimeError("Invalid binary data format: %s" %
type(data))
Can someone tell me what I'm doing wrong? It's the last piece I have to figure out.

The Pluviophile's answer is correct, but you should use output in pd.ExcelWriter instead of file_name:
def dfs_tabs(df_list, sheet_list, file_name):
output = BytesIO()
writer = pd.ExcelWriter(output, engine='xlsxwriter')
for dataframe, sheet in zip(df_list, sheet_list):
dataframe.to_excel(writer, sheet_name=sheet, startrow=0 , startcol=0)
writer.save()
processed_data = output.getvalue()
return processed_data

When I first run the streamlit app locally I get this error:
"NameError: name 'booked_grouped' is not defined"
Assuming your code
booked_grouped = st.fileuploader('Something.....`)
You can use the below method to skip the error
if booked_grouped:
# All your code inside this indentation
To Download excel
Convert all dataframes to one single excel
# Function to save all dataframes to one single excel
def dfs_tabs(df_list, sheet_list, file_name):
output = BytesIO()
writer = pd.ExcelWriter(file_name,engine='xlsxwriter')
for dataframe, sheet in zip(df_list, sheet_list):
dataframe.to_excel(writer, sheet_name=sheet, startrow=0 , startcol=0)
writer.save()
processed_data = output.getvalue()
return processed_data
# list of dataframes
dfs = [df, df1, df2]
# list of sheet names
sheets = ['df','df1','df2']
Note that the data to be downloaded is stored in memory while the user is connected, so it's a good idea to keep file sizes under a couple of hundred megabytes to conserve memory.
df_xlsx = dfs_tabs(dfs, sheets, 'multi-test.xlsx')
st.download_button(label='📥 Download Current Result',
data=df_xlsx ,
file_name= 'df_test.xlsx')

Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own

I am writing some automated scripts to process Excel files in Python, some are in XLS format. Here's a code snippet of my attempting to do so with Pandas:
df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])
contents is the file contents pulled from an AWS S3 bucket. When this line runs I get [ERROR] ValueError: File is not a recognized excel file.
In troubleshooting this, I have tried to access the spreadsheet using xlrd directly:
book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))
This works without errors so xlrd seems to recognize it as an Excel file, just not when asked to do so by Pandas.
Anyone know why Pandas won't read the file with xlrd as the engine? Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?

Or can someone help me take the sheet from xlrd and convert it into a
Pandas dataframe?
pd.read_excel can take a book...
import xlrd
book = xlrd.open_workbook(filename='./file_check/file.xls')
df = pd.read_excel(book, skiprows=5)
print(df)
some column headers
0 1 some foo
1 2 strings bar
2 3 here yes
3 4 too no
I'll include the code below that may help if you want to check/handle Excel file types. Maybe you can adapt it for your needs.
The code loops through a local folder and shows the file and extension but then uses python-magic to drill into it. It also has a column showing guessing from mimetypes but that isn't as good. Do zoom into the image of the frame and see that some .xls are not what the extension says. Also a .txt is actually an Excel file.
import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic
path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")
data = []
for file in all_files:
name, extension = os.path.splitext(file)
data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])
df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])
# del df['magic.from_file(file, mime=True)']
df
From there you could filter files based on their type:
xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
xls_file_format = 'application/vnd.ms-excel'
for file in all_files:
if magic.from_file(file, mime=True) == xlsx_file_format:
print('xlsx')
# DO SOMETHING SPECIAL WITH XLSX FILES
elif magic.from_file(file, mime=True) == xls_file_format:
print('xls')
# DO SOMETHING SPECIAL WITH XLS FILES
else:
continue
dfs = []
for file in all_files:
if (magic.from_file(file, mime=True) == xlsx_file_format) or \
(magic.from_file(file, mime=True) == xls_file_format):
# who cares, it all works with this for the demo...
df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
dfs.append(df)
print('\nHow many frames did we get from seven files? ', len(dfs))
Output:
xlsx
xls
xls
xlsx
How many frames did we get from seven files? 4

Unable to read excel file , list index out of range error, cant find Sheets

I am trying to read excel (.xlsx) file and convert it to dataframe. I used pandas.ExelFile , pandas.read_excel, openpyxl load_workbook and even io file reading methods but i am unable to read Sheet of this file. Every time i get list index out of range error or no sheet names is case of openpyxl. Also tried xlrd method.
temp_df = pd.read_excel("v2s.xlsx", sheet_name = 0)
or
temp_df = pd.read_excel("v2s.xlsx", sheet_name = "Sheet1")
or
from openpyxl import load_workbook
workbook = load_workbook(filename="v2s.xlsx",read_only = True, data_only = True)
workbook.sheetnames
Link to excel file

According to this ticket, the file is saved in a "slightly defective" format.
The user posted that he used Save As to change the type of document back to a normal Excel spreadsheet file.
Your file is this type:
You need to save it as:
Then running your code
from openpyxl import load_workbook
workbook = load_workbook(filename="v2s_0.xlsx",read_only = True, data_only = True)
print(workbook.sheetnames)
Outputs:
['Sheet1']

How do I download an xlsm file and read every sheet in python?

Right now I am doing the following.
import xlrd
resp = requests.get(url, auth=auth).content
output = open(r'temp.xlsx', 'wb')
output.write(resp)
output.close()
xl = xlrd.open_workbook(r'temp.xlsx')
sh = 1
try:
for sheet in xl.sheets():
xls.append(sheet.name)
except:
xls = ['']
It's extracting the sheets but I don't know how to read the file or if saving the file as an .xlsx is actually working for macros. All I know is that the code is not working right now and I need to be able to catch the data that is being generated in a macro. Please help! Thanks.

I highly recommend using xlwings if you want to open, modify, and save .xlsm files without corrupting them. I have tried a ton of different methods (using other modules like openpyxl) and the macros always end up being corrupted.
import xlwings as xw
app = xw.App(visible=False) # IF YOU WANT EXCEL TO RUN IN BACKGROUND
xlwb = xw.Book('PATH\\TO\\FILE.xlsm')
xlws = {}
xlws['ws1'] = xlwb.sheets['Your Worksheet']
print(xlws['ws1'].range('B1').value) # get value
xlws['ws1'].range('B1').value = 'New Value' # change value
yourMacro = xlwb.macro('YourExcelMacro')
yourMacro()
xlwb.save()
xlwb.close()
Edit - I added an option to keep Excel invisible at users request

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas 0.25.0 and xlsx from response content stream - python

Here is my current replacement code that seems to be working: wb = load_workbook(filename=BytesIO(r.raw.read())) ws = wb['Tasks'] data = ws.values columns = next(data)[0:] df = pd.DataFrame(data, columns=columns)

This is how I solved this problem to download an xlsx excel file to pandas 1.0 DataFrame. This works on pandas >= 1.0 xl = requests.get(EXCEL_URL) df = pd.read_excel(BytesIO(xl.content), sheet_name="Worksheet Name") if sheet_name not given 1st sheet will be loaded.

Related

zipfile.BadZipFile: File is not a zip file when using "openpyxl" engine

How do I write multiple dataframes to excel using python and download the results using streamlit?

Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own

Unable to read excel file , list index out of range error, cant find Sheets

How do I download an xlsm file and read every sheet in python?

Categories

Resources