When attempting to open my xml files, I am prompted with three choices:
"Please select how you would like to open this file:
As an XML table
As a read-only workbook
Use the XML Source task pane."
Option 1 basically opens the xml file in a table in an excel worksheet, which is beautiful and easy to work with. More importantly, I have done all of my data cleaning in R on this format.
Now, I would like to automate the process of simply opening the xml file, selecting open "as an XML table" and then saving as an xlsx file.
I am very close to doing this using win32com, but the issue is that the file seems to be saved in the "As a read-only workbook" mode which means all of the column headers are messed up.
My Question: I want to convert an XML file into a table like that feature in excel does it and then immediately save it into xlsx format (without even opening the xml file) using preferably win32com.
My current code:
excel = win32com.client.DispatchEx('Excel.Application')
# excel.visible = True
xml1 = excel.Workbooks.Open(os.getcwd() + "\\" + "1.xml")
xml1.SaveAs(os.getcwd() +"\\" + "1.xlsx")
xml1.Close()
I do not want to open every xml file, it is not very time efficient.
Related
I am trying to make a python program that downloads and XLS file from a website, in this case website is: https://www.blackrock.com/uk/individual/products/291392/, and loads it as a dataframe in pandas, with the correct data structure.
The issue is that when I try to load it via pandas, it gives me an error: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf\xef\xbb\xbf<?'
I am not quite sure what is causing this error, but presumable something with the file. I can open the file in Excel, even though I get a warning that the file and the file extension do not match, and that the file might be dangerous etc. If I click yes to opening it anyway, it opens up with data displayed correctly. If I use Excel to save the file as .xlsx i can open it in pandas, but I would rather a solution that didn't require manually opening Excel and saving the file.
I have tried renaming the file extension to xlsx, but this does not work, as it won't allow me to open the file with that extension.
I have tried many different extension, but non of them bite - unfortunately.
I am at a loss.
I hope, you can help.
EDIT: The code I use is:
download_path = 'https://www.blackrock.com/uk/individual/products/291392/fund/1527484370694.ajax?fileType=xls&fileName=iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund&dataType=fund'
testing = pd.read_excel(download_path, engine='xlrd', sheet_name = 'Holdings', skiprows = 3)
The actual problem is that the file format is SpreadSheetML which has only been used briefly between 2003 and 2006. It has been overtaken by the XLSX format. Since, it has been around for a short time and while ago, most packages do not support for load/save operations. More about the format can be found here: https://learn.microsoft.com/en-us/previous-versions/office/developer/office-xp/aa140066(v=office.10)?redirectedfrom=MSDN
For this reason, the Pandas or any other XML parser (e.g Etree) will not be able to load properly. The regular MS Office software would still load it correctly. As far as I know, you can deal with SpreadSheetML files using aspose-cells package: https://products.aspose.com/cells/python-java/
For your case:
# Import packages
import jpype
import asposecells
jpype.startJVM()
from asposecells.api import Workbook, FileFormatType
from asposecells.api import HtmlSaveOptions
# Read Workbook
workbook = Workbook('iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.xls')
worksheet = workbook.getWorksheets().get(0)
# Accessing a cell using its name
cells = worksheet.getCells()
cell = cells.get("A1")
# Print Message
print("Cell Value: " + str(cell.getValue())) # Prints Cell Value: 17-Nov-2021
# To save SpreadSheetML in different format (HTML)
saveOptions = HtmlSaveOptions()
saveOptions.setDisableDownlevelRevealedComments(True)
workbook.save("iShares-MSCI-World-SRI-UCITS-ETF-USD-Dist_fund.html", saveOptions)
As mentioned by Slybot, this is not a real xls file.
If you inspect the contents in a plain text editor, or a hex editor, the header starts:
<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
which confirms this is an xml document, and not an Office 2007 zipped xlsx office document.
Your next steps depend on whether you have Excel installed on the machine that will be running this code or not, and if not, what other libraries you have access to and are willing to pay for - Slybot has mentioned aspose for example.
The easiest solution - Excel
If you are running this on a Windows machine with Excel installed, you have the free and capable option of automating the operation of opening Excel and saving as xlsx. This is by using Win32com module, described in this answer:
Attempting to Parse an XLS (XML) File Using Python
Alternatively, save your Excel styled XML as xlsx with Workbook.SaveAs method using win32com (only for Windows users) and read in with pandas.read_excel skipping appropriate rows.
The XML solution
You could read in the raw XML and digest it. The relevant nodes are:
<ss:Workbook>
<ss:Worksheet ss:Name="Holdings">
<ss:Table>
<ss:Row>
<ss:Cell ss:StyleID="Left">
<ss:Data ss:Type="String">iShares MSCI World SRI UCITS ETF</ss:Data>
The Third-party library solution
I am not familiar with any libraries which provide this functionality, and can't advise on this option.
I created a Python script which scrapes a website, grabs the information I need and writes it in an Excel file. The problem is that, sometimes, after I check what has been written, I forget to close the Excel file and the script no longer runs (due to the fact that the file is already opened)
To open/write/save/close the Excel file, I am using the openpyxl library, but I don't know how to check if that specific file is opened and how to close it afterwards.
#writing to excel
pathexcel = r'C:\Users\...\Data.xlsx'
wb = load_workbook(pathexcel)
sheet = wb.active
sheet.append(row1)
sheet.append(row2)
sheet.append(row3)
wb.save(pathexcel)
Thank you!
I have 5 Excel files that have to be compiled into one csv file that can be uploaded to our website for our affiliated stores database. Until now we've had someone manually cut and paste the rows of each file into one master csv file in Excel then they upload that file to the website.
I've been trying to use Python to consolidate the files so the user would just have to run the Python script that would do this for her. The problem is that the Excel files are encoded in Shift-JIS and when I use CSV writer in Python they get converted to UTF-8. The website we upload them to will only accept files in Shift-JIS, so I have to keep all of this data in Shift-JIS.
Since DOS automatically defaults to ascii encoding, I first have to run this:
import codecs, sys, xlrd, csv
reload(sys)
sys.setdefaultencoding('shift_jis')
Here is a sample of the code for one of the Excel files, which has data on 2 separate worksheets:
with xlrd.open_workbook('Circle.xls') as wb:
for sheet in wb.sheets():
fn = 'store-'
print "Converting files.."
with open(fn + sheet.name + ".csv","wb") as f:
c = csv.writer(f,dialect="excel")
for r in range(sheet.nrows):
c.writerow(sheet.row_values(r))
The conversion runs until it finds a UTF-8 character that doesn't exist in shift-JIS, then it errors out.
Is there a way to convert from Excel to a csv purely in shift-JIS?
(If my question has a flaw, please ask me to edit it before marking it down! I will edit it!)
I have many Python scripts that output CSV files. It is occasionally convenient to open these files in Excel. After installing OS X Mavericks, Excel no longer opens these files properly: Excel doesn't parse the files and it duplicates the rows of the file until it runs out of memory. Specifically, when Excel attempts to open the file, a prompt appears that reads: "File not loaded completely."
Example of code I'm using to generate the CSV files:
import csv
with open('csv_test.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow([1,2,3])
writer.writerow([4,5,6])
Even the simple file generated by the above code fails to load in Excel. However, if I open the CSV file in a text editor and copy/paste the text into Excel, parse it with text to columns, and then save as CSV from Excel, then I can reopen the CSV file in Excel without issue. Do I need to pass an additional parameter in my scripts to make Excel parse the CSV files the same way it used to? Or is there some setting I can change in OS X Mavericks or Excel? Thanks.
Maybe I had the similar problem, the error message "SYLK: File format is not valid" when open python autogenerated csv file. The solution is really funny. The first two characters must not be I and D in uppercase (ID). Also see "SYLK: File format is not valid" error message when you open file.
Possible solution1: use *.txt instead of *.csv. In this case Excel (at least, 2010) will show you an import data wizard where you can specify delimiters, character encoding, field types, etc.
UPD: Solution2:
The python "csv" module has a "dialect" feature. For example, the following modification of your code generates valid csv file for my environment (Python 2.7, Excel 2010, Windows7, locale with ";" list delimiters):
import csv
with open('csv_test2.csv', 'wb') as f:
csv.excel.delimiter=';'
writer = csv.writer(f, dialect=csv.excel)
writer.writerow([1,2,3])
writer.writerow([4,5,6])
Is there a simple way to save a .dbf file as a .xls file using python. I just want to open the .dbf file in python and then immediately do a 'save as .xls'. I'm trying to avoid looping through records in a dbf and writing to the excel table.
Something like:
fileobj = open(outFolder + "\\" + "TABLE.dbf")
fileobj.save(outFolder + "\\" + "TABLE.xls")
I've searched around the internet, but haven't found anything, so I thought I'd try a post.
Thanks,
Mike
If you were willing to accept using the Python csv module (which would involve looping through dbf records and writing csv), consider going straight to xls by using xlwt.
Alternatively, could your client could be shown how to open dbf files in Excel?