Reading extracted XLSX files with OpenPyXL

Reading extracted XLSX files with OpenPyXL - python

So I've been using Python 3.2, and OpenPyXL's iterable workbook as demonstrated here in the "Optimized Reader" example.
My problem arises when I try to use this strategy to read a file or files that I've extracted from a simple .zip archive (both manually and through the python zipfile package). When I call .get_highest_column() I get "A" and .get_highest_row() I get 1, and when asked to print each cell's value as shown here:
wb = load_workbook(filename = file_name, use_iterators = True)
ws = wb.worksheets[0] # Only need to read the first sheet, nothing fancy
for row in ws.iter_rows():
for entry in row:
print(entry.internal_value)
It prints the values in A1, A2, A3, A4, A5, A6, and A7, regardless of how large the file actually is. There isn't any reason for this in the file itself, and it will open in Excel perfectly fine. I'm quite stumped as to why it does it like this, but I assume that the unzipped XLSX is formatted differently prior to being saved from within Excel, and OpenPyXL cannot interpret it correctly. I even renamed the '.xlsx' to '.zip' so that I could explore the file and examine the differences, but couldn't tell much except that the one saved from Excel also has a subfolder called "theme" within the "xl" folder that the previous version does not, with font and formatting data.
IMPORTANT NOTE: When I open it and re-save it with the same filename from within Excel and then run this bit of code, it works perfectly - returns correct greatest row and column values, and correctly prints every cell value. I've tried instead saving the workbook through OpenPyXL immediately after opening it, but this yields the same erroneous results.
Basically, I need to discover a method to properly extract a .xlsx file from a .zip file so that it can be read with OpenPyXL. There are many many files that need to be processed like this, so it must be external to Excel, and hopefully as efficient as possible.
Cheers!

It sounds like this has nothing to do with the extraction from the zipfile, as the problem also occurs if you manually extract the files.
I would try to store the files opened and saved with Excel in a zipfile and see what happens. If that works, then clearly the way the original .xlsx files were generated is the problem.
I strongly suspect that to be the case.
If that is the problem, see if you can extract the .xlsx files (they are zipfiles themselves) and compare the one you re-saved with Excel to the original problematic one. xml does not compare easily as Excel can rearrange most things at will, but you might be able to do a diff.

Related

How can I iterate through Excel Worksheets that aren't explicitly named using XlsxWriter?

I'm a total novice when it comes to programming. I'm trying to write a Python 3 program that will produce an Excel workbook based on the contents of a CSV file. So far, I understand how to create the workbook, and I'm able to dynamically create worksheets based on the contents of the CSV file, but I'm having trouble writing to each individual worksheet.
Note, in the example that follows, I'm providing a static list, but my program dynamically creates a list of names based on the contents of the CSV file: the number of names that will be appended to the list varies from 1 to 60, depending on the assay in question.
import xlsxwriter
workbook = xlsxwriter.Workbook('C:\\Users\\Jabocus\\Desktop\\Workbook.xlsx')
list = ["a", "b", "c", "d"]
for x in list:
worksheet = workbook.add_worksheet(x)
worksheet.write("A1", "Hello!")
workbook.close()
If I run the program as it appears above, I get a SyntaxError, and IPython points to workbook.close() as the source of the problem.
However, if I exclude the line where I try to write "Hello!" to cell A1 in every worksheet, the program runs as I'd expect: I end up with Workbook.xlsx on my desktop, and it has 4 worksheets named a, b, c, and d.
The for loop seemed like a good choice to me, because my program will need to handle a variety of CSV formats (I'd rather write one program that can process data from every assay at my lab than a program for each assay).
My hope was that by using worksheet.write() in the way that I did, Python would know that I want to write to the worksheet that I just created (i.e. I thought worksheet would act as the name for each worksheet during each iteration of the loop despite explicitly naming each worksheet something new).
It feels like the iteration is the problem here, and I know that it has something to do with how I'm trying to reference each Worksheet in the write() step (because I'm not giving any of the Worksheet objects an explicit name), but I don't know how to proceed. What's a good way that I might approach this problem?

I'm not sure exactly what is wrong with your code, but I can tell you this:
I copied your code exactly (except for changing the path to be my desktop) and it worked fine.
I believe your issue could be one of three things:
You have a buggy/old version of XlsxWriter
You have a file called Workbook.xlsx on your Desktop already that is corrupted or causing some issues (open in another program.)
You have some code other than what you posted.
To account for all of these possibilities, I would recommend that you:
Reinstall XlsxWriter:
In a Command Prompt run pip uninstall XlsxWriter followed by pip install XlsxWriter
Change the filename of the workbook you are opening:
workbook = xlsxwriter.Workbook('C:\\Users\\Jabocus\\Desktop\\Workbook2.xlsx')
Try running the code that you posted exactly, then incrementally add to it until it stops working.

Did you try something like worksheet.write(0, 0, 'Hello')
instead of worksheet.write('A1', 'Hello')

Python Openpyxl VLOOKUP from other file

I have 2 files, lets say 't1.xlsx' and 't2.xlsx'.
What i want to do is to do the VLOOKUP fucntion inside the t1 file using the data from t2 file.
I try to paste
"sheet["O2"].value = "=VLOOKUP(C:C;'C:\\Users\\KKK\\Desktop\\sheets\\excellent\\
[t2.xlsx]baza'!$A$2:$AI$10480;25;0)"
where baza is a sheet name, but sadly when i try open the file it says it can not be open due to the error and offers me repairing tool.
rest of the code:
import openpyxl
wb = openpyxl.load_workbook('t1.xlsx')
sheets = wb.get_sheet_names()
sheet = wb.get_sheet_by_name('Sheet1')
[VLOOKUP STUFF FROM BEFORE]
wb.save("t1.xlsx")

With more complicated formulae you should always check the syntax in the XML because they are often stored differently than they appear in Excel. This is covered in the documentation. You might be okay simply using a comma as a separator but I suspect you'll also have change the path of the file and use a Python raw string (the r prefix).

How do I increase the default column width of a csv file so that when I open the file all of the text fits correctly?

I am trying to code a function where I grab data from my database, which already works correctly.
This is my code for the headers prior to adding the actual records:
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the column names of the excel table prior to adding the actual physical data
template_writer.writerow(['Arrangement_ID','Quantity','Cost'])
#closes the file after appending
template_file.close()
This is my code for the records which is contained in a while loop and is the main reason that the two scripts are kept separate.
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the data of the current fetched values of the sql statement within the while loop to the csv file
template_writer.writerow([transactionWordData[0],transactionWordData[1],transactionWordData[2]])
#closes the file after appending
template_file.close()
Now once I have got this data ready for excel, I run the file in excel and I would like it to be in a format where I can print immediately, however, when I do print the column width of the excel cells is too small and leads to it being cut off during printing.
I have tried altering the default column width within excel and hoping that it would keep that format permanently but that doesn't seem to be the case and every time that I re-open the csv file in excel it seems to reset completely back to the default column width.
Here is my code for opening the csv file in excel using python and the comment is the actual code I want to use when I can actually format the spreadsheet ready for printing.
#finds the os path of the csv file depending where it is in the file directories
file_path = os.path.abspath("csv_template.csv")
#opens the csv file in excel ready to print
os.startfile(file_path)
#os.startfile(file_path, 'print')
If anyone has any solutions to this or ideas please let me know.

Unfortunately I don't think this is possible for CSV file formats, since they are just plaintext comma separated values and don't support formatting.
I have tried altering the default column width within excel but every time that I re-open the csv file in excel it seems to reset back to the default column width.
If you save the file to an excel format once you have edited it that should solve this problem.
Alternatively, instead of using the csv library you could use xlsxwriter instead which does allow you to set the width of the columns in your code.
See https://xlsxwriter.readthedocs.io and https://xlsxwriter.readthedocs.io/worksheet.html#worksheet-set-column.
Hope this helps!

The csv format is nothing else than a text file, where the lines follow a given pattern, that is, a fixed number of fields (your data) delimited by comma. In contrast an .xlsx file is a binary file that contains specifications about the format. Therefore you may want write to an Excel file instead using the rich pandas library.

You can add space like as it is string so it will automatically adjust the width do it like this:
template_writer.writerow(['Arrangement_ID ','Quantity ','Cost '])

Duplicating PDF file by variable

I'm working on a project and I've ran into a brick wall.
We have a text file which has a numeric value and pdf file name. If the value is 6 we need to print 6 copies of the PDF. I was thinking of creating X copies of the PDF per line then combining them after. I'm not sure this is the most efficient way to do it but was wondering if anyone else has another idea.
DATA
1,PDF1
2,PDF5
7,PDF2
923,PDF33

You should be using the python CSV module to read in your data into two variables, numCopies and filePath https://docs.python.org/2/library/csv.html
You can then just use
for i in range(1, numCopies):
shutil.copyfile(filePath, newFilePath)
or something along those lines.
If you want to physically work with the PDF files I'd recommend the pyPdf module.

How do i extract specific lines of data from a huge Excel sheet using Python?

I need to get specific lines of data that have certain key words in them (names) and write them to another file. The starting file is a 1.5 GB Excel file. I can't just open it up and save it as a different format. How should I handle this using python?

I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]
How big is the file in MB? ["Huge" is not a useful answer]
What software created the file?
How much memory do you have on your computer?
Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".
Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using xlrd.open_workbook? Include the script that you ran, the full traceback, and the error message
What operating system, what version of Python, what version of xlrd?
Do you know how many worksheets there are in the file?

It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.
Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.
If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.
Assuming the first case:
import xlrd
wb_path = r'c:\bigfile.xls'
output_path = r'c:\output.txt'
wb = xlrd.open(wb_path)
ws = wb.sheets()[0] # assuming you want to work with the first sheet in the workbook
with open(output_path, 'w') as output_file:
for i in xrange(ws.nrows):
row = [cell.value for cell in ws.row(i)]
# ... replace the following if statement with your own conditions ...
if row[0] == u'interesting':
output_file.write('\t'.join(row) + '\r\n')
This will give you a tab-delimited output file that should open in Excel.
Edit:
Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.

I haven't used it, but xlrd looks like it does a good job reading Excel data.

Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.