I have used openpyxl with Workbook(write_only=True) to create large Excel xlsx files. In this mode I cannot format the Excel headers. I then save the xlsx and open it again with openpyxl load_workbook(my_book) and have the ability to format the cells. I then save the file. If the file isn't too large it saves but otherwise gives a memory error.
So Openfile allows me to create and save the worksheet but not necessarily re-open and save the same worksheet.
In this example I just load_workbook then save without changing the xlsx to show the error
from openpyxl import Workbook
from openpyxl import load_workbook
from openpyxl.styles import Font
from openpyxl.styles import PatternFill
wb = Workbook(write_only=True)
# then append a lot of rows
logging.info('Save unformatted xlsx')
wb.save(my_book)
workbook = load_workbook(my_book)
# the worksheet always loads ok at this point, even when 700,000 rows
workbook.save(my_book)
## Immediately after load_workbook I do workbook.save(my_book).
When the rowcount is around 8600 there is no problem. When 350,000 there is a memory error
File "src\lxml\serializer.pxi", line 1268, in lxml.etree._IncrementalFileWriter._handle_error
File "src\lxml\etree.pyx", line 316, in lxml.etree._ExceptionContext._raise_if_stored
File "src\lxml\serializer.pxi", line 650, in lxml.etree._FilelikeWriter.write
MemoryError
logging.info('Saved unformatted xlsx immediately after opening again')
workbook = load_workbook(my_book)
# If no error I do some formatting and all is well and can save ok
Python 3.4.3
openpyxl (2.5.1)
lxml (4.2.1)
There are a lot of solutions for older versions but I cannot see any for openpyxl (2.5.1).
Does anyone have an answer for openpyxl, or can recommend what to use to open an existing large xlsx and format cells?
It wasn't of topic, it was an error even if just a memory error. Anyway in the end I just exported to CSV rather than XSLX as Excel still opens it nicely, not worth the bother.
Related
I download a XLS file from the web using selenium.
I tried many options I found in stack-overflow and other websites to read the XLS file :
import pandas as pd
df = pd.read_excel('test.xls') # Read XLS file
Expected "little-endian" marker, found b'\xff\xfe'
And
df = pd.ExcelFile('test.xls').parse('Sheet1') # Read XLSX file
Expected "little-endian" marker, found b'\xff\xfe'
And again
from xlrd import open_workbook
book = open_workbook('test.xls')
CompDocError: Expected "little-endian" marker, found b'\xff\xfe'
I have tried different encoding: utf-8, ANSII, utf_16_be, utf16
I have even tried to get the encoding of the file from notepad or other applications.
Type of file : Microsoft Excel 97-2003 Worksheet (.xls)
I can open the file with Excel without any issue.
What's frustrating is that if I open the file with excel and just press save I then can read the file with of the previous python command.
I would be really grateful if someone could provide me other ideas I could try. I need to open this file with a python script only.
Thanks,
Max
Solution(Somewhat messy but simple) that could potentially work for any type of Excel file :
Called VBA from python to Open and save the file in Excel. Excel "clean-up" the file and then Python is able to read it with any read Excel type function
Solution inspired by #Serge Ballesta and #John Y comments.
## Open a file in Excel and save it to correct the encoding error
import win32com.client
import pandas
downloadpath="c:\\firefox_downloads\\"
filename="myfile.xls"
xl=win32com.client.Dispatch("Excel.Application")
xl.Application.DisplayAlerts = False # disables Excel pop up message (for saving the file)
wb = xl.Workbooks.Open(Filename=downloadpath+filename)
wb.SaveAs(downloadpath+filename)
wb.Close
xl.Application.DisplayAlerts = True # enables Excel pop up message for saving the file
df = pandas.ExcelFile(downloadpath+filename).parse('Sheet1') # Read XLSX file
Thank you all!
What does pd mean?? What
pandas is made for data science. In my opinion, you have to use openpyxl (read and write only xlsx) or xlwt/xlrd (read xls... and write only xls).
from xlrd import open_workbook
book = open_workbook(<math file>)
sheet =....
It has several examples with this on Internet...
I have scripted code for writing pandas df into excel file with openpyxl. See Fill in pd data frame into existing excel sheet (using openpyxl v2.3.2).
from openpyxl import load_workbook
import pandas as pd
import numpy as np
book=load_workbook("excel_proc.xlsx")
writer=pd.ExcelWriter("excel_proc.xlsx", engine="openpyxl")
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
data_df.to_excel(writer, sheet_name="example", startrow=100, startcol=5, index=False)
writer.save()
That procedure works fine. However, each returned excel file reports, when opening, that it is corrupted, since content is not readable. Excel can repair it and save it again. But this has to be done manually. Since I have to process many files, how can i solve/circumvent that?
Alternatively, how do I have to change the code to use "xlsxwriter" instead of "openpyxyl"?
When I just exchange "engine="openpyxl"" with "engine="xlsxwriter"" python tells me that "'Worksheet' object has no attribute 'write'" at the data_df.to_excel line.
Addition: Excel tells me "removed records named range of /xl/workbook.xml part" is the corruption and has to be repaired. I do not know, what it means
I think you'll have to use openpyxl, because xlsxwriter doesn't support yet modifying of existing Excel XLSX files.
From docs:
It cannot read or modify existing Excel XLSX files.
I'm new in python, and I tried to write data into existing excel file using openpyxl, my excel file has a little complicated format. After running my simple code, I checked my excel file and all the format is corrupted.
This is my code:
import openpyxl
xfile = openpyxl.load_workbook('sample.xlsx')
sheet = xfile.get_sheet_by_name('Sheet1')
sheet['A1'] = 'hello world'
xfile.save('sample.xlsx')
Please help me figure out how to fix it, or suggest an alternative library that can work well with "xlsx" file.
I am having problems appending issues appending data to an xls file.
Long story short, I am using a program to get some data from something and writing it in an xls file.
If I run the script 10 times, I would like the results to be appended to the same xls file.
My problem is that I am forced to use Python 3.4 and xlutils is not supported, so I cannot use the copy function.
I just have to use xlwt / xlrd. Note, the file cannot be a xlsx.
Is there any way i can do this?
I would look into using openpyxl, which is supported by Python 3.4. An example of appending to a file can be found https://openpyxl.readthedocs.org/en/default/. Please also see: How to append to an existing excel sheet with XLWT in Python. Here is an example that will do it. Assuming you have an Excel sheet called sample.xlsx:
from openpyxl import Workbook, load_workbook
# grab the active worksheet
wb = load_workbook("sample.xlsx")
ws = wb.active
ws.append([3])
# Save the file
wb.save("sample.xlsx")
I'm reading a existing excel file by using openpyxl package and trying to save that file it, and it got saved but after opening that excel file no data is present. I used the following code and my requirement is to open the file in use_iterators = True mode only
from openpyxl import load_workbook
wb = load_workbook(filename = 'large_file.xlsx', use_iterators = True)
ws = wb.get_sheet_by_name(name = 'big_data')
for row in ws.iter_rows():
for cell in row:
print cell.internal_value
wb.save("large_file.xlsx")
can u guys show how to save the file and close the file after saving with out losing the data
Try loading with use_iterators = False, as use_iterators = True loads the data information differently, such that it may not contain all the information you wish to save.
Openpyxl writes and entirely new excel file based on the information it has read in, so it's not like you make a small change and just update the file. (This also means if certain features aren't supported in openpyxl (such as VB macros), these won't exist in the file you've saved.)