Python Openpyxl VLOOKUP from other file

Python Openpyxl VLOOKUP from other file - python

I have 2 files, lets say 't1.xlsx' and 't2.xlsx'.
What i want to do is to do the VLOOKUP fucntion inside the t1 file using the data from t2 file.
I try to paste
"sheet["O2"].value = "=VLOOKUP(C:C;'C:\\Users\\KKK\\Desktop\\sheets\\excellent\\
[t2.xlsx]baza'!$A$2:$AI$10480;25;0)"
where baza is a sheet name, but sadly when i try open the file it says it can not be open due to the error and offers me repairing tool.
rest of the code:
import openpyxl
wb = openpyxl.load_workbook('t1.xlsx')
sheets = wb.get_sheet_names()
sheet = wb.get_sheet_by_name('Sheet1')
[VLOOKUP STUFF FROM BEFORE]
wb.save("t1.xlsx")

With more complicated formulae you should always check the syntax in the XML because they are often stored differently than they appear in Excel. This is covered in the documentation. You might be okay simply using a comma as a separator but I suspect you'll also have change the path of the file and use a Python raw string (the r prefix).

Related

Python: Unsupported format, or corrupt file

I am trying to make a python program that downloads and XLS file from a website, in this case website is: https://www.blackrock.com/uk/individual/products/291392/, and loads it as a dataframe in pandas, with the correct data structure.
The issue is that when I try to load it via pandas, it gives me an error: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf\xef\xbb\xbf<?'
I am not quite sure what is causing this error, but presumable something with the file. I can open the file in Excel, even though I get a warning that the file and the file extension do not match, and that the file might be dangerous etc. If I click yes to opening it anyway, it opens up with data displayed correctly. If I use Excel to save the file as .xlsx i can open it in pandas, but I would rather a solution that didn't require manually opening Excel and saving the file.
I have tried renaming the file extension to xlsx, but this does not work, as it won't allow me to open the file with that extension.
I have tried many different extension, but non of them bite - unfortunately.
I am at a loss.
I hope, you can help.
EDIT: The code I use is:
download_path = 'https://www.blackrock.com/uk/individual/products/291392/fund/1527484370694.ajax?fileType=xls&fileName=iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund&dataType=fund'
testing = pd.read_excel(download_path, engine='xlrd', sheet_name = 'Holdings', skiprows = 3)

The actual problem is that the file format is SpreadSheetML which has only been used briefly between 2003 and 2006. It has been overtaken by the XLSX format. Since, it has been around for a short time and while ago, most packages do not support for load/save operations. More about the format can be found here: https://learn.microsoft.com/en-us/previous-versions/office/developer/office-xp/aa140066(v=office.10)?redirectedfrom=MSDN
For this reason, the Pandas or any other XML parser (e.g Etree) will not be able to load properly. The regular MS Office software would still load it correctly. As far as I know, you can deal with SpreadSheetML files using aspose-cells package: https://products.aspose.com/cells/python-java/
For your case:
# Import packages
import jpype
import asposecells
jpype.startJVM()
from asposecells.api import Workbook, FileFormatType
from asposecells.api import HtmlSaveOptions
# Read Workbook
workbook = Workbook('iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.xls')
worksheet = workbook.getWorksheets().get(0)
# Accessing a cell using its name
cells = worksheet.getCells()
cell = cells.get("A1")
# Print Message
print("Cell Value: " + str(cell.getValue())) # Prints Cell Value: 17-Nov-2021
# To save SpreadSheetML in different format (HTML)
saveOptions = HtmlSaveOptions()
saveOptions.setDisableDownlevelRevealedComments(True)
workbook.save("iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.html", saveOptions)

As mentioned by Slybot, this is not a real xls file.
If you inspect the contents in a plain text editor, or a hex editor, the header starts:
<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
which confirms this is an xml document, and not an Office 2007 zipped xlsx office document.
Your next steps depend on whether you have Excel installed on the machine that will be running this code or not, and if not, what other libraries you have access to and are willing to pay for - Slybot has mentioned aspose for example.
The easiest solution - Excel
If you are running this on a Windows machine with Excel installed, you have the free and capable option of automating the operation of opening Excel and saving as xlsx. This is by using Win32com module, described in this answer:
Attempting to Parse an XLS (XML) File Using Python
Alternatively, save your Excel styled XML as xlsx with Workbook.SaveAs method using win32com (only for Windows users) and read in with pandas.read_excel skipping appropriate rows.
The XML solution
You could read in the raw XML and digest it. The relevant nodes are:
<ss:Workbook>
<ss:Worksheet ss:Name="Holdings">
<ss:Table>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">iShares MSCI World SRI UCITS ETF</ss:Data>
The Third-party library solution
I am not familiar with any libraries which provide this functionality, and can't advise on this option.

OpenPyxl cannot write VLOOKUP function to excel worksheet

I'm trying to write a VLOOKUP function using OpenPyxl to a column of cells. Everything about the code works just fine, except that excel crashes when I try to open the document after writing the functions to the cells.
I've tried writing the exact same functions but with within parentheses and then opening the excel document and removing the parentheses manually, which also works perfect. Then, Excel calculates the values exactly as it should.
I'm wondering if there's a formatting error going on? Is there anything I have overlooked when trying to write functions using Openpyxl?
Basically the code I want to work:
wb = load_workbook(path_result + '/' + 'File.xlsx')
ws = wb['Main 2018-04-17']
ws[{B}{2}].value = =VLOOKUP('Main 2018-04-17'!A2;'Data 2018-04-17'!C2:E100;2;FALSE)"
wb.save(path_result + '/' + 'File.xlsx')

This is covered in the documentation: you must use a comma to separate arguments. See http://openpyxl.readthedocs.io/en/stable/usage.html#using-formulae

Reading extracted XLSX files with OpenPyXL

So I've been using Python 3.2, and OpenPyXL's iterable workbook as demonstrated here in the "Optimized Reader" example.
My problem arises when I try to use this strategy to read a file or files that I've extracted from a simple .zip archive (both manually and through the python zipfile package). When I call .get_highest_column() I get "A" and .get_highest_row() I get 1, and when asked to print each cell's value as shown here:
wb = load_workbook(filename = file_name, use_iterators = True)
ws = wb.worksheets[0] # Only need to read the first sheet, nothing fancy
for row in ws.iter_rows():
for entry in row:
print(entry.internal_value)
It prints the values in A1, A2, A3, A4, A5, A6, and A7, regardless of how large the file actually is. There isn't any reason for this in the file itself, and it will open in Excel perfectly fine. I'm quite stumped as to why it does it like this, but I assume that the unzipped XLSX is formatted differently prior to being saved from within Excel, and OpenPyXL cannot interpret it correctly. I even renamed the '.xlsx' to '.zip' so that I could explore the file and examine the differences, but couldn't tell much except that the one saved from Excel also has a subfolder called "theme" within the "xl" folder that the previous version does not, with font and formatting data.
IMPORTANT NOTE: When I open it and re-save it with the same filename from within Excel and then run this bit of code, it works perfectly - returns correct greatest row and column values, and correctly prints every cell value. I've tried instead saving the workbook through OpenPyXL immediately after opening it, but this yields the same erroneous results.
Basically, I need to discover a method to properly extract a .xlsx file from a .zip file so that it can be read with OpenPyXL. There are many many files that need to be processed like this, so it must be external to Excel, and hopefully as efficient as possible.
Cheers!

It sounds like this has nothing to do with the extraction from the zipfile, as the problem also occurs if you manually extract the files.
I would try to store the files opened and saved with Excel in a zipfile and see what happens. If that works, then clearly the way the original .xlsx files were generated is the problem.
I strongly suspect that to be the case.
If that is the problem, see if you can extract the .xlsx files (they are zipfiles themselves) and compare the one you re-saved with Excel to the original problematic one. xml does not compare easily as Excel can rearrange most things at will, but you might be able to do a diff.

How to rename files and change the file type as well?

I have a list with .dbf files which I want to change to .csv files. By hand I open them in excel and re-save them as .csv, but this takes too much time.
Now I made a script which changes the file name, but when I open it, it is still a .dbf file type (although it is called .csv). How can I rename the files in such a way that the file type also changes?
My script uses (the dbf and csv file name are listed in a seperate csv file):
IN = dbffile name
OUT = csvfile name
for output_line in lstRename:
shutil.copyfile(IN,OUT)

Changing the name of a file (and the extension is just part of the complete name) has absolutely no effect on the contents of the file. You need to somehow convert the contents from one format to the other.
Using my dbf module and python it is quite simple:
import dbf
IN = 'some_file.dbf'
OUT = 'new_name.csv'
dbf.Table(IN).export(filename=OUT)
This will create a .csv file that is actually in csv format.

If you have ever used VB or looked into VBA, you can write a simple excel script to open each file, save it as csv and then save it with a new name.
Use the macro recorder to record you once doing it yourself and then edit the resulting script.
I have now created a application that automates this. Its called xlsto (look for the xls.exe release file). It allows you to pick a folder and convert all xls files to csv (or any other type).

You need a converter
Search for dbf2csv in google.

It depends what you want to do. It seems like you want to convert files to other types. There are many converters out there, but a computer alone doesn't know every file type. For that you will need to download some software. If all you want to do is change the file extension,
(ex. .png, .doc, .wav) then you can set your computer to be able to change both the name and the extension. I hoped I helped in some way :)

descargar libreria dbfpy desde http://sourceforge.net/projects/dbfpy/?source=dlp
import csv,glob
from dbfpy import dbf
entrada = raw_input(" entresucarpetadbf ::")
lisDbf = glob.glob(entrada + "\\*dbf")
for db in lisDbf:
print db
try:
dbfFile = dbf.Dbf(open(db,'r'))
csvFile = csv.writer(open(db[:-3] + "csv", 'wb'))
headers = range(len(dbfFile.fieldNames))
allRows = []
for row in dbfFile:
rows = []
for num in headers:
rows.append(row[num])
allRows.append(rows)
csvFile.writerow(dbfFile.fieldNames)
for row in allRows:
print row
csvFile.writerow(row)
except Exception,e:
print e

It might be that the new file name is "xzy.csv.dbf". Usually in C# I put quotes in the filename. This forces the OS to change the filename. Try something like "Xzy.csv" in quotes.

How do i extract specific lines of data from a huge Excel sheet using Python?

I need to get specific lines of data that have certain key words in them (names) and write them to another file. The starting file is a 1.5 GB Excel file. I can't just open it up and save it as a different format. How should I handle this using python?

I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]
How big is the file in MB? ["Huge" is not a useful answer]
What software created the file?
How much memory do you have on your computer?
Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".
Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using xlrd.open_workbook? Include the script that you ran, the full traceback, and the error message
What operating system, what version of Python, what version of xlrd?
Do you know how many worksheets there are in the file?

It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.
Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.
If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.
Assuming the first case:
import xlrd
wb_path = r'c:\bigfile.xls'
output_path = r'c:\output.txt'
wb = xlrd.open(wb_path)
ws = wb.sheets()[0] # assuming you want to work with the first sheet in the workbook
with open(output_path, 'w') as output_file:
for i in xrange(ws.nrows):
row = [cell.value for cell in ws.row(i)]
# ... replace the following if statement with your own conditions ...
if row[0] == u'interesting':
output_file.write('\t'.join(row) + '\r\n')
This will give you a tab-delimited output file that should open in Excel.
Edit:
Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.

I haven't used it, but xlrd looks like it does a good job reading Excel data.

Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.