Reading excel file from zip archive using python and openpyxl

Reading excel file from zip archive using python and openpyxl - python

I have a password protected zip archive containing some excel spreadsheets with some confidential data. I'd like to take the password from the user, open the zip, and analyze the spreadsheet inside, without actually extracting the excel file. Is it possible to construct openpyxl's workbook from the zip entry directly and do some analysis on the data in the workbook? I am trying to avoid extracting the excel to file system to avoid potential security problems (e.g. undeleted temp files).
Is this possible to do? Is it possible, for example, to treat the zip archive as some pseudo file system?
Thanks in advance!

Related

Python - Opening a Password Protected CSV to read/write

I have been using python for only about two months so I am still quite new to coding.
Recently, in work, I wrote a code which opens an existing CSV file, performs a few operations and spits out a new CSV file. That bit I am happy with.
But what I want to know is what can I do in terms of securing the document and still running the code to open it? For example, I want to password protect this CSV file but want to prompt the user for the password which will be the only way to open/read the file.
Can anyone point me in the right direction please?

The CSV file is simply a formatted text (.txt) file. So to protect the file there are a few approaches.
Save CSV file then change permissions on it using OS commands - Password Protecting Excel file using Python
zip the csv file with password. Unfortunately, "The builtin zipfile module does not support writing password-encrypted files (only reading). Either you could use pyminizip. Refer to Create password protected zip file Python

Rename multiple files using a list of names on excel

I have a bunch of PDF files with random names, like 95456356.pdf, 7896548965.pdf and so on. I also have a list of names in an Excel file with all the names in a column. Can I write a code that would read that Excel file and then rename all PDF files in the same order, like the first file would be renamed to the name in the fisrt row in the excel file? I can copy and paste that column to a .txt file if that makes it easier.

You used the xlrd package to read your Excel file. A nice tutorial on how to read Excel files can be found here: https://www.geeksforgeeks.org/reading-excel-file-using-python/
With this package it should be relatively easy to write code with the desired behaviour:
You should be able to read one cell with the name you want to use for renaming the files
You should be able to read the filenames of the pdf files
You should be able to rename files using the os package (see here: How to rename a file using Python)

Modifying and writing data in an existing excel file using Python

I have an Excel file(xlsx) that already has lots of data in it. Now I am trying to use Python to write new data into this Excel file. I looked at xlwt, xldd, xlutils, and openpyxl, all of these modules requires you to load the data of my excel sheet, then apply changes and save to a new Excel file. Is there any way to just change the data in the existing excel sheet rather than load workbook or saving to new files?

This is not possible because XLSX files are zip archives and cannot be modified in place. In theory it might be possible to edit only a part of the archive that makes up an OOXML package but, in practice this is almost impossible because relevant data may be spread across different files.

Please check Openpyxl once again. You can load the data, do things with python, write your results in a new sheet in the same file or same sheet and save it (as everything is happening in memory).
e.g:
load data
wb = openpyxl.load_workbook("file.xlsx", data_only=True)
manipulate with python
# python codes
create sheet
some_sheet = wb.create_sheet("someSheet") # by default at the end
program to write in sheet
# program to write in sheet
save file (don't forget to close the excel file if its open before saving, as it will raise "Permission Error")
wb.save("file.xlsx"
here is the link
https://openpyxl.readthedocs.io/en/default/tutorial.html

Why does zipfile.is_zipfile returns True on xlsx file?

I am using is_zipfile to check if it is a zipfile before extracting it. But the method returns True on excel file from a StringIO object. I am using Python 2.7. Does anyone know how to fix this? Is it reliable to use is_zipfiile? Thanks.

Quoting from the Microsoft's XLSX Structure overview doc,
Workbook data is contained in a ZIP package conforming to the Open
Packaging Conventions
So, .xlsx files are actually zip files only. If you want not to consider them as a zip file, you may have to exclude with an if condition like this
if os.path.splitext(filename)[1] != ".xlsx" and zipfile.is_file(filename):

This is because xlsx is actually a valid zip file.
See also:
Office Open XML
The Microsoft Office XML formats

Combine tab-separated value (TSV) files into an Excel 2007 (XLSX) spreadsheet

I need to combine several tab-separated value (TSV) files into an Excel 2007 (XLSX) spreadsheet, preferably using Python. There is not much cleverness needed in combining them - just copying each TSV file onto a separate sheet in Excel will do. Of course, the data needs to be split into columns and rows same as Excel does when I manually copy-paste the data into the UI.
I've had a look at the raw XML file Excel 2007 generates and it's huge and complex, so writing that from scratch doesn't seem realistic. Are there any libraries available for this?

Looks like xlwt may serve your needs -- you can read each TSV file with Python's standard library csv module (which DOES do tab-separated as well as comma-separated etc, don't worry!-) and use xlwt (maybe via this cheatsheet;-) to create an XLS file, make sheets in it, build each sheet from the data you read via csv, etc. Not sure about XLSX vs plain XLS support but maybe the XLS might be enough...?

The best python module for directly creating Excel files is xlwt, but it doesn't support XLSX.
As I see it, your options are:
If you only have "several", you could just do it by hand.
Use pythonwin to control Excel through COM. This requires you to run the code on a Windows machine with Excel 2007 installed.
Use python to do some preprocessing on the TSV to produce a format that will make step (1) easier. I'm not sure if Excel reads TSV, but it will certainly read CSV files directly.

Note that Excel 2007 will quite happily read "legacy" XLS files (those written by Excel 97-2003 and by xlwt). You need XLSX files because .....?
If you want to go with the defaults that Excel will choose when deciding whether each piece of your data is a number, a date, or some text, use pythonwin to drive Excel 2007. If the data is in a fixed layout such that other than a possible heading row, each column contains data that is all of one known type, consider using xlwt.
You may wish to approach xlwt via http://www.python-excel.org which contains an up-to-date tutorial for xlrd, xlwt, and xlutils.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading excel file from zip archive using python and openpyxl - python

Related

Python - Opening a Password Protected CSV to read/write

Rename multiple files using a list of names on excel

Modifying and writing data in an existing excel file using Python

Why does zipfile.is_zipfile returns True on xlsx file?

Combine tab-separated value (TSV) files into an Excel 2007 (XLSX) spreadsheet

Categories

Resources