I have tons of .xlsm files that I have to load. Each Excel file has 6 sheets. Because of that, I'm opening each Excel file like this, using pandas:
for excel_file in files_list:
with pd.ExcelFile(excel_file, engine = "openpyxl") as f:
df1 = pd.read_excel(f, "Sheet1")
df2 = pd.read_excel(f, "Sheet2")
df3 = pd.read_excel(f, "Sheet3")
...
After each iteration I am passing the df to other function and do some stuff with it. I am using pd.ExcelFile to load the file into memory just once and then separate it on DataFrames.
However, when doing this, I am getting the following warning:
/opt/anaconda3/lib/python3.8/site-packages/openpyxl/worksheet/_reader.py:300: UserWarning: Data Validation extension is not supported and will be removed
warn(msg)
No matter the warning, the information is loaded correctly from the Excel file and no data is missing. It takes about 0.8s to load each Excel file and all of its sheets into df. If I use the default engine on pandas to load each Excel file, the warning goes away, but the time it takes for each file goes up to 5 or even 6 seconds.
I saw this post, but there wasn't an answer on how to remove the warning, which is what I need, as everything's working correctly.
How can I disable said UserWarning?
You can do this using warnings core module:
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='openpyxl')
You can also specify the particular module you'd like to silence warnings for by adding an argument module="openpyxl".
See What causes "UserWarning: Discarded range with reserved name" - openpyxl -- different warning, same solution - so you put the warnings back to default after you open the book, since there may be other warnings that you do want to see.
import warnings
warnings.simplefilter("ignore")
wb = load_workbook(path)
warnings.simplefilter("default")
If you want to ignore this warning specifically, and do it in a given context only, you can combine catch_warnings and filterwarnings with the message argument. E.g:
import warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="Data Validation extension is not supported and will be removed")
data = pd.read_excel(f, sheet_name=None)
Note: sheet_name=None will read all the Excel sheets in one go too.
Related
I have a 140MB Excel file I need to analyze using pandas. The problem is that if I open this file as xlsx it takes python 5 minutes simply to read it. I tried to manually save this file as csv and then it takes Python about a second to open and read it! There are different 2012-2014 solutions that why Python 3 don't really work on my end.
Can somebody suggest how to convert very quickly file 'C:\master_file.xlsx' to 'C:\master_file.csv'?
There is a project aiming to be very pythonic on dealing with data called "rows". It relies on "openpyxl" for xlsx, though. I don't know if this will be faster than Pandas, but anyway:
$ pip install rows openpyxl
And:
import rows
data = rows.import_from_xlsx("my_file.xlsx")
rows.export_to_csv(data, open("my_file.csv", "wb"))
I faced the same problem as you. Pandas and openpyxl didn't work for me.
I came across with this solution and that worked great for me:
import win32com.client
xl=win32com.client.Dispatch("Excel.Application")
xl.DisplayAlerts = False
xl.Workbooks.Open(Filename=your_file_path,ReadOnly=1)
wb = xl.Workbooks(1)
wb.SaveAs(Filename='new_file.csv', FileFormat='6') #6 means csv
wb.Close(False)
xl.Application.Quit()
wb=None
xl=None
Here you convert the file to csv by means of Excel. All the other ways that I tried refuse to work.
Use read-only mode in openpyxl. Something like the following should work.
import csv
import openpyxl
wb = load_workbook("myfile.xlsx", read_only=True)
ws = wb['sheetname']
with open("myfile.csv", "wb") as out:
writer = csv.writer(out)
for row in ws:
values = (cell.value for cell in row)
writer.writerow(values)
Fastest way that pops to mind:
pandas.read_excel
pandas.DataFrame.to_csv
As an added benefit, you'll be able to do cleanup of the data before saving it to csv.
import pandas as pd
df = pd.read_excel('C:\master_file.xlsx', header=0) #, sheetname='<your sheet>'
df.to_csv('C:\master_file.csv', index=False, quotechar="'")
At some point, dealing with lots of data will take lots of time. Just a fact of life. Good to look for options if it's a problem, though.
I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)
I have an .xslx file with specific formatting and objects that I am using for reports that I plan on producing on a large scale using Python. I originally was openpyxl to load a copy of the template (openpyxl.load_workbook()), write a Pandas dataframe to the file (openpyxl.dataframe_to_rows()), then save the file for future distribution. I found out that openpyxl.load_workbook does not load the formatting or objects so they are removed from the new file. So then tried xlrd to open the file (xlrd.open_workbook()) which loaded the formatting and objects properly. However openpyxl will no longer write to the file creating empty copies of the template file. Is there another package I can use that will handle the reading/writing by itself or a package I can use instead of openpyxl? Xlsxwriter didn't work either. See code sample below.
from xlrd import open_workbook
from openpyxl.utils.dataframe import dataframe_to_rows
import pandas as pd
import shutil
shutil.copy2('template.xlsx', 'new_report.xlsx')
book = open_workbook('new_report.xlsx')
writer = pd.ExcelWriter(book, engine='openpyxl')
ws = book.sheet_by_name('Sheet1')
for r in dataframe_to_rows(result, index=False, header=False):
ws.cell(colx=1, rowx=1)
ws.append(r)
book.save('new_report.xlsx')
I'm also getting the errors: "AttributeError: 'Book' object has no attribute 'save'" and "AttributeError: 'Sheet' object has no attribute 'append'" from the code if anyone has suggestions for those problems.
I ended up using formulas to recreate any formatting I had in the existing Excel file after pasting the new data. I'm still missing the objects (Ex. shapes) but my reports will live without them until I can find another work around.
Is there a way to have pandas read in only the values from excel and not the formulas? It reads the formulas in as NaN unless I go in and manually save the excel file before running the code. I am just working with the basic read excel function of pandas,
import pandas as pd
df = pd.read_excel(filename, sheetname="Sheet1")
This will read the values if I have gone in and saved the file prior to running the code. But after running the code to update a new sheet, if I don't go in and save the file after doing that and try to run this again, it will read the formulas as NaN instead of just the values. Is there a work around that anyone knows of that will just read values from excel with pandas?
That is strange. The normal behaviour of pandas is read values, not formulas. Likely, the problem is in your excel files. Probably your formulas point to other files, or they return a value that pandas sees as nan.
In the first case, the sheet needs to be updated and there is nothing pandas can do about that (but read on).
In the second case, you could solve by setting explicit nan values in read_excel:
pd.read_excel(path, sheetname="Sheet1", na_values = [your na identifiers])
As for the first case, and as a workaround solution to make your work easier, you can automate what you are doing by hand using xlwings:
import pandas as pd
import xlwings as xl
def df_from_excel(path):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path)
df = df_from_excel(path to your file)
If you want to keep those formulas in your excel file just save the file in a different location (book.save(different location)). Then you can get rid of the temporary files with shutil.
I had this problem and I resolve it by moving a graph below the first row I was reading. Looks like the position of the graphs may cause problems.
you can use xlrd to read the values.
first you should refresh your excel sheet you are also updating the values automatically with python. you can use the function below
file = myxl.xls
import xlrd
import win32com.client
import os
def refresh_file(file):
xlapp = win32com.client.DispatchEx("Excel.Application")
path = os.path.abspath(file)
wb = xlapp.Wordbooks.Open(path)
wb.RefreshAll()
xlapp.CalculateUntilAsyncqueriesDone()
wb.save()
xlapp.Quit()
after the file refresh, you can start reading the content.
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_index(0)
for rowid in range(worksheet.nrows):
row = worksheet.row(rowid)
for colid, cell in enumerate(row):
print(cell.value)
you can loop through however you need the data. and put conditions while you are reading the data. lot more flexibility
I can't use the read_excel method from pandas library in my Ipython note book.
After some test and cleaning in the Excel file, I understood their is a complete column of drawings (or images). When I deleted this column I stop the error message. Does somebody know how to configure read_excel option to collect only dataes? This is my code:
import pandas as pd
import os
# File selection
userfilepath = r'C:\Temp'
filename = "exportCS12.xlsx"
filenameCS12 = os.path.join(userfilepath, filename)
print(filenameCS12)
# workbook upload
df = pd.read_excel(filenameCS12, sheetname='Sheet1')
Pandas import was not working due to a none clean excel file. Problem sovlve with openpyxl, able to navigate in excel only in validated areas.