Prevent formulas added to an openpyxl workbook from executing - python

When using openpyxl to create spreadsheets based on untrusted input (for example, data exports from a web application for admin analysis), formulas can be a vector for script injection. If excel executes malicious formulas in a spreadsheet, they can take over the admin's machine or exfiltrate data.
For example, this simple workbook adds a formula:
from openpyxl import Workbook
wb = Workbook()
ws = wb.active()
ws.append(["=1 + 2"])
ws.save(filename='/tmp/formula.xlsx')
When opening /tmp/formula.xlsx in excel, the formula is executed. =1 + 2 is benign, but it could also be something more evil like =2+5+cmd|' /C calc'!A0. [reference]
How can I write data to a worksheet to ensure that it is not interpreted as a formula? It would be convenient to retain formatting for non-executable data like dates and numbers, rather than coercing everything to strings.

You're right that code injection is a risk, though it's arguably Excel's job to sandbox here and if you're worried about this then you really ought to think about additional protections.
We do expose the calculation node of the workbook settings so I think changing wb.calculation.fullCalcOnLoad = False might do want you need. But you'll probably need to read the specification to be certain.

I had this issue recently. I went with a hack which adds a tab character to the start of a string value if it starts with an =. Something like this:
if value and value[0] == '"':
value = "\t" + value
Another method would be to use cell.set_explicit_value:
wb['A1'].set_explicit_value(value, data_type="s")

Related

How can I iterate through Excel Worksheets that aren't explicitly named using XlsxWriter?

I'm a total novice when it comes to programming. I'm trying to write a Python 3 program that will produce an Excel workbook based on the contents of a CSV file. So far, I understand how to create the workbook, and I'm able to dynamically create worksheets based on the contents of the CSV file, but I'm having trouble writing to each individual worksheet.
Note, in the example that follows, I'm providing a static list, but my program dynamically creates a list of names based on the contents of the CSV file: the number of names that will be appended to the list varies from 1 to 60, depending on the assay in question.
import xlsxwriter
workbook = xlsxwriter.Workbook('C:\\Users\\Jabocus\\Desktop\\Workbook.xlsx')
list = ["a", "b", "c", "d"]
for x in list:
worksheet = workbook.add_worksheet(x)
worksheet.write("A1", "Hello!")
workbook.close()
If I run the program as it appears above, I get a SyntaxError, and IPython points to workbook.close() as the source of the problem.
However, if I exclude the line where I try to write "Hello!" to cell A1 in every worksheet, the program runs as I'd expect: I end up with Workbook.xlsx on my desktop, and it has 4 worksheets named a, b, c, and d.
The for loop seemed like a good choice to me, because my program will need to handle a variety of CSV formats (I'd rather write one program that can process data from every assay at my lab than a program for each assay).
My hope was that by using worksheet.write() in the way that I did, Python would know that I want to write to the worksheet that I just created (i.e. I thought worksheet would act as the name for each worksheet during each iteration of the loop despite explicitly naming each worksheet something new).
It feels like the iteration is the problem here, and I know that it has something to do with how I'm trying to reference each Worksheet in the write() step (because I'm not giving any of the Worksheet objects an explicit name), but I don't know how to proceed. What's a good way that I might approach this problem?
I'm not sure exactly what is wrong with your code, but I can tell you this:
I copied your code exactly (except for changing the path to be my desktop) and it worked fine.
I believe your issue could be one of three things:
You have a buggy/old version of XlsxWriter
You have a file called Workbook.xlsx on your Desktop already that is corrupted or causing some issues (open in another program.)
You have some code other than what you posted.
To account for all of these possibilities, I would recommend that you:
Reinstall XlsxWriter:
In a Command Prompt run pip uninstall XlsxWriter followed by pip install XlsxWriter
Change the filename of the workbook you are opening:
workbook = xlsxwriter.Workbook('C:\\Users\\Jabocus\\Desktop\\Workbook2.xlsx')
Try running the code that you posted exactly, then incrementally add to it until it stops working.
Did you try something like worksheet.write(0, 0, 'Hello')
instead of worksheet.write('A1', 'Hello')

Python Openpyxl VLOOKUP from other file

I have 2 files, lets say 't1.xlsx' and 't2.xlsx'.
What i want to do is to do the VLOOKUP fucntion inside the t1 file using the data from t2 file.
I try to paste
"sheet["O2"].value = "=VLOOKUP(C:C;'C:\\Users\\KKK\\Desktop\\sheets\\excellent\\
[t2.xlsx]baza'!$A$2:$AI$10480;25;0)"
where baza is a sheet name, but sadly when i try open the file it says it can not be open due to the error and offers me repairing tool.
rest of the code:
import openpyxl
wb = openpyxl.load_workbook('t1.xlsx')
sheets = wb.get_sheet_names()
sheet = wb.get_sheet_by_name('Sheet1')
[VLOOKUP STUFF FROM BEFORE]
wb.save("t1.xlsx")
With more complicated formulae you should always check the syntax in the XML because they are often stored differently than they appear in Excel. This is covered in the documentation. You might be okay simply using a comma as a separator but I suspect you'll also have change the path of the file and use a Python raw string (the r prefix).

OpenPyxl cannot write VLOOKUP function to excel worksheet

I'm trying to write a VLOOKUP function using OpenPyxl to a column of cells. Everything about the code works just fine, except that excel crashes when I try to open the document after writing the functions to the cells.
I've tried writing the exact same functions but with within parentheses and then opening the excel document and removing the parentheses manually, which also works perfect. Then, Excel calculates the values exactly as it should.
I'm wondering if there's a formatting error going on? Is there anything I have overlooked when trying to write functions using Openpyxl?
Basically the code I want to work:
wb = load_workbook(path_result + '/' + 'File.xlsx')
ws = wb['Main 2018-04-17']
ws[{B}{2}].value = =VLOOKUP('Main 2018-04-17'!A2;'Data 2018-04-17'!C2:E100;2;FALSE)"
wb.save(path_result + '/' + 'File.xlsx')
This is covered in the documentation: you must use a comma to separate arguments. See http://openpyxl.readthedocs.io/en/stable/usage.html#using-formulae

How can I create a template .csv file that users can fill in before running my script?

I am trying to make a script that requires a user to input at least 12 different values in order to function. I thought this was somewhat impractical, so I decided to make a function that would generate dict from a .csv file that was designed with two columns– variables and their respective values. The user could use a provided .csv file as a template and then fill it in with all their necessary values, save it as their own .csv file, and then run it with the script.
Although this sounds simple in theory, I have found that is not working quite so well in practice. Because some of the inputs values will be text with a lot of periods in them ("..."), they are sometimes converted into the unicode representing horizontal ellipses (xe2\x80\xa6). Also, a UTF-8 mark will occur at the beginning of the first column and row (which can be designated by the code codecs.BOM_UTF8), and must be removed. In other cases, the delimiter of the .csv file was changed so that tabs were recognized as separating cells, or the contents of each row were converted from two to one cell.
I have no experience with the different forms of encoding or what any of them entail, but from what I have tested, it seems that opening the .csv template file in Excel or using different settings when opening your .csv file causes such problems. It's also possible that copying and pasting the values from other places brings hidden characters with them. I have been trying to fix the problems but then new problems keep springing up, and I feel like it's possible that my current approach is just wrong.
Can anybody recommend me a different, more efficient approach for allowing a user to enter in multiple inputs in one go? Or should I stick to my original approach and figure out how to keep the .csv formatting as rigorous as possible?
You can always use the csv module to abstract away most of the CSV oddities (although you will have to enforce the basic format):
import csv
import sys
def main(argv):
if len(argv) < 2:
print("Please provide path to your CSV template as the first argument.")
return 1
with open(argv[1], "r") as f:
reader = csv.DictReader(f)
your_vars = next(reader)
print(your_vars) # prints a dictionary of all CSV vars
return 0
if __name__ == "main":
sys.exit(main(sys.argv))
NOTE: This requires the first row to hold the variables, while the second holds their values.
So all users have to do is call the script with: python your_script.py their_file.csv and in most cases it will print out a dict with the values... However, Excel is notoriously bad in handling unicode CSVs and if your users use it as their primary spreadsheet app they're likely to encounter issues. Some of that can be rectified by installing the unicodecsv module and using it as a drop-in replacement (import unicodecsv as csv) but if your users start going wild with the format eventually it will break.
If you're looking for suggestions on formats, one of the most user-friendly formats you can use is YAML and there are several parsers available for Python - they largely work the same for the simple stuff like this but I'd recommend using the ruamel.yaml module as it's actively maintained.
Then you can create a YAML template like:
----
var1: value1
var2: value2
var3: value3
etc: add as many as you want
And your users can fill in the values in a simple text editor, then to replicate the above CSV behavior all you need is:
import yaml
import sys
def main(argv):
if len(argv) < 2:
print("Please provide path to your YAML template as the first argument.")
return 1
with open(argv[1], "r") as f:
your_vars = yaml.load(f)
print(your_vars) # prints a dictionary of all YAML vars
return 0
if __name__ == "main":
sys.exit(main(sys.argv))
Bonus is that YAML is plain-text format so your users don't need fancy editors and therefore they have a lesser chance to screw up. Of course, while YAML is permissive it still requires modicum of well-formedness so be sure to include the usual checks (if the file exists, can it be open, can it be parsed etc.)

Get the inputs from Excel and use those inputs in python script

How to get the inputs from excel and use those inputs in python.
Take a look at xlrd
This is the best reference I found for learning how to use it: http://www.dev-explorer.com/articles/excel-spreadsheets-and-python
Not sure if this is exactly what you're talking about, but:
If you have a very simple excel file (i.e. basically just one table filled with string-values, nothing fancy), and all you want to do is basic processing, then I'd suggest just converting it to a csv (comma-seperated value file). This can be done by "saving as..." in excel and selecting csv.
This is just a file with the same data as the excel, except represented by lines seperated with commas:
cell A:1, cell A:2, cell A:3
cell B:1, cell B:2, cell b:3
This is then very easy to parse using standard python functions (i.e., readlines to get each line of the file, then it's just a list that you can split on ",").
This if of course only helpful in some situations, like when you get a log from a program and want to quickly run a python script which handles it.
Note: As was pointed out in the comments, splitting the string on "," is actually not very good, since you run into all sorts of problems. Better to use the csv module (which another answer here teaches how to use).
import win32com
Excel=win32com.client.Dispatch("Excel.Application")
Excel.Workbooks.Open(file path)
Cells=Excel.ActiveWorkBook.ActiveSheet.Cells
Cells(row,column).Value=Input
Output=Cells(row,column).Value
If you can save as a csv file with headers:
Attrib1, Attrib2, Attrib3
value1.1, value1.2, value1.3
value2,1,...
Then I would highly recommend looking at built-in the csv module
With that you can do things like:
csvFile = csv.DictReader(open("csvFile.csv", "r"))
for row in csvFile:
print row['Attrib1'], row['Attrib2']

Categories