I have been provided with a xlsb file full of data. I want to process the data using python. I can convert it to csv using excel or open office, but I would like the whole process to be more automated. Any ideas?
Update: I took a look at this question and used the first answer:
import subprocess
subprocess.call("cscript XlsToCsv.vbs data.xlsb data.csv", shell=False)
The issue is the file contains greek letters so the encoding is not preserved. Opening the csv with Notepad++ it looks as it should, but when I try to insert into a database comes like this ���. Opening the file as csv, just to read text is displayed like this:
\xc2\xc5\xcb instead of ΒΕΛ.
I realize it's an issue in encoding, but it's possible to retain the original encoding converting the xlsb file to csv ?
I've encountered this same problem and using pyxlsb does it for me:
from pyxlsb import open_workbook
with open_workbook('HugeDataFile.xlsb') as wb:
for sheetname in wb.sheets:
with wb.get_sheet(sheetname) as sheet:
for row in sheet.rows():
values = [r.v for r in row] # retrieving content
csv_line = ','.join(values) # or do your thing
Most popular Excel python packages openpyxl and xlrd have no support for xlsb format (bug tracker entries: openpyxl, xlrd).
So I'm afraid there is no native python way =/. However, since you are using windows, it should be easy to script the task with external tools.
I would suggest taking look at Convert XLS to XLSB Programatically?. You mention python in title but the matter of the question does not imply you are strongly coupled to it, so you could go pure c# way.
If you feel really comfortable only with python one of the answers there suggests a command line tool under a fancy name of Convert-XLSB. You could script it as an external tool from python with subprocess.
I know this is not a good answer, but I don't think there is better/easier way as of now.
In my previous experience, i was handling converting xlsb using libreoffice command line utility,
In ruby i just execute system command to call libreoffice for converting xlsb format to csv:
`libreoffice --headless --convert-to csv your_xlsb_file.xlsb --outdir /path/csv`
and to change the encoding i use command line to using iconv, using ruby :
`iconv -f ISO-8859-1 -t UTF-8 your_csv_file.csv > new_file_csv.csv`
I also looked at the problem and the following worked for me. First opening the file in excel via python and than saving it to different file. Bit of a workaround but I like it more than other solutions. In example I use file format 6 which is CSV but you can also use other ones.
import win32com.client
excel = win32com.client.Dispatch("Excel.Application")
excel.DisplayAlerts = False
excel.Visible=False
doc = excel.Workbooks.Open("C:/users/A295998/Python/#TA1PROG3.xlsb")
doc.SaveAs(Filename="C:\\users\\A295998\\Python\\test5.csv",FileFormat=6)
doc.Close()
excel.Quit()
XLSB is a binary format and I don't think you'll be able to parse it with current python tools and packages. If you still want to somehow automate the process with python you can do what the others have told you and script that windows CLI tool. Calling the .exe from the command line with subprocess, and passing an array of the files you want to convert.
I.e: with a script similar to this one you could convert all the .xlsb files that you place in the "xlsb" folder to .csv format...
├── xlsb
│ ├── file1.xlsb
│ ├── file2.xlsb
│ └── file3.xlsb
└── xlsb_to_csv.py
xlsb_to_csv.py
#!/usr/bin/env python
import os
files = [f for f in os.listdir('./xlsb')]
for f in files:
subprocess.call("ConvertXLS.EXE " + str(f) + " --arguments", shell=True)
Note: the Windows command is pseudocode... I use a similar approach to batch-convert stuff in headless windows servers for testing purpouses. You just have to figure out the exe location and the windows command...
Hope it helps... good luck!
I think you can do this using pyuno. This blog entry shows how to convert xls files to csv, and as open office supports xlsb files since version 3.2, this code might just work for you. You will have to go through hassle of setting up the pyuno environment though..
The script you reference seem to use the ActiveX interface to Excel, and save via its Workbook.SaveAs method.
According to the MSDN documentation this method have a TextCodepage argument which may be helpful.
Sidenote: You can rewrite the VB script in python, see this question.
Related
I am trying to make a python program that creates and writes in a txt file.
the program works, but I want it to cross the "hidden" thing in the txt file's properties, so that the txt can't be seen without using the python program I made. I have no clues how to do that, please understand I am a beginner in python.
I'm not 100% sure but I don't think you can do this in Python. I'd suggest finding a simple Visual Basic script and running it from your Python file.
Assuming you mean the file-properties, where you can set a file as "hidden". Like in Windows as seen in screenshot below:
Use operating-system's command-line from Python
For example in Windows command-line attrib +h Secret_File.txt to hide a file in CMD.
import subprocess
subprocess.run(["attrib", "+h", "Secret_File.txt"])
See also:
How to execute a program or call a system command?
Directly call OS functions (Windows)
import ctypes
path = "my_hidden_file.txt"
ctypes.windll.kernel32.SetFileAttributesW(path, 2)
See also:
Hide Folders/ File with Python
Rename the file (Linux)
import os
filename = "my_hidden_file.txt"
os.rename(filename, '.'+filename) # the prefix dot means hidden in Linux
See also:
How to rename a file using Python
I found several questions that were similar to mine, but none of the answers came close to what I need.
Specifications: I'm working with Python 3 and do not have MS Word. My programming machine is running OS X and cloud machine is linux/ubuntu too.
I'm using python-docx to extract values from a .doc file that is sent to me nightly. However, python-docx only works with .docx files, so I need to convert the file to that extension first.
So, I've got a .doc file that I need to convert to .docx. This script might have to run in the cloud so I can't install any kind of Office or Office-like software. Can this be done?
You are working with Linux/ubuntu, you can use LibreOffice’s inbuilt converter.
SYNTAX
lowriter --convert-to docx *.doc
Example
lowriter --convert-to docx testdoc.doc
This will convert all doc files to docx and save in the same folder itself.
You could use unoconv - Universal Office Converter. Convert between any document format supported by LibreOffice/OpenOffice.
unoconv -d document --format=docx *.doc
subprocess.call(['unoconv', '-d', 'document', '--format=docx', filename])
Aspose.Words Cloud SDK for Python can convert DOC to DOCX. The package can open, generate, edit, split, merge, compare and convert a Word document in Python on any platform without depending on MS Word.
It is a paid product, but the free plan provides 150 free monthly API calls.
P.S: I'm developer evangelist at Aspose.
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Get your credentials from https://dashboard.aspose.cloud (free registration is required).
words_api = asposewordscloud.WordsApi(app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx',app_key='xxxxxxxxxxxxxxxxxxxxxxxxx')
words_api.api_client.configuration.host = 'https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.doc'
dest_name = 'C:/Temp/02_pages.docx'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='docx')
result = words_api.convert_document(request)
copyfile(result, dest_name)
import aspose.words as aw
path1="doc file path"
path2="path to save converted file"
file2=file.rsplit('.',1)[0]+'.docx'
filename1=os.path.join(path2,file2)
filename=os.path.join(path1,file)
doc = aw.Document(filename)
doc.save(filename1)
First you will need to be using Windows. If that is an acceptable barrier then please read on....
Next you need to install the Microsoft Office Compatibility Pack.
Now download and install the Microsoft Office Migration Planning Manager.
To run the tool you need to create a .ini file that controls the program. An example .ini file and further information is available on this blog post.
There is more detailed information from Microsoft here.
For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library.
I noticed that for majority of files I cannot read them completely with the warc-library. For example if the warc.gz file has 517 records, I can read only about 200 of them.
After some research I found out that this problem happens only with the gzipped files. The files with extension "warc" do not have this problem.
I have found out that some people have this problem as well (https://github.com/internetarchive/warc/issues/21), while no solution for it is found.
I guess that there might be a bug in "gzip" in Python 2.7.11. Does maybe someone have experience with this, and know what can be done about this problem?
Thanks in advance!
Example:
I create new warc.gz files like this:
import warc
warc_path = "\\some_path\file_name.warc.gz"
warc_file = warc.open(warc_path, "wb")
To write records I use:
record = warc.WARCRecord(payload=value, headers=headers)
warc_file.write_record(record)
This creates perfect "warc.gz" files. There are no problems with them. All, including "\r\n" is correct. But the problem starts when I read these files.
To read files I use:
warc_file = warc.open(warc_path, "rb")
To loop through records I use:
for record in warc_file:
...
The problem is that not all records are found during this looping for "warc.gz" file, while they all are found for "warc" files. Working with both types of files is addressed in the warc-library itself.
It seems that the custom gzip handling in warc.gzip2.GzipFile, file splitting with warc.utils.FilePart and reading in warc.warc.WARCReader is broken as a whole (tested with python 2.7.9, 2.7.10 and 2.7.11). It stops short when it receives no data instead of a new header.
It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well:
import gzip
import warc
with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
for record in warc.WARCFile(fileobj=gzf):
print record.payload.read()
I've been scouring the net to find a Python library or tool that can converts an Excel file to/from ODS format, but haven't been able to come across anything.
I need the ability to input and output data in either format. We don't need to worry about merged cells, formulas or anything non-straightforward.
If you have libreoffice installed, you can do a python execution wrapper around its headless mode:
$ /usr/bin/libreoffice --headless --invisible -convert-to ods /home/cwgem/Downloads/QTL_Sample_data.xls
convert /home/cwgem/Downloads/QTL_Sample_data.xls -> /home/cwgem/QTL_Sample_data.ods using OpenDocument Spreadsheet Flat XML
$ /usr/bin/libreoffice --headless --invisible -convert-to xls /home/cwgem/QTL_Sample_data.ods
convert /home/cwgem/QTL_Sample_data.ods -> /home/cwgem/QTL_Sample_data.xls using
Which would be a bit easier than trying to do it through the library route.
I succeeded to convert an xlsx file to an ods file with this method :
install pyexcel-xlsx
install pyexcel-ods3
And use the following code :
from pyexcel_ods3 import save_data
from pyexcel_xlsx import get_data
dataXlsx = get_data("file.xlsx")
save_data("file.ods", dataXlsx)
Attention : the colors/design of the xlsx file is removed in the ouput ods file...so that it is not a real success.
I've got a bunch of FoxPro (VFP9) DBF files on my Ubuntu system, is there a library to open these in Python? I only need to read them, and would preferably have access to the memo fields too.
Update: Thanks #cnu, I used Yusdi Santoso's dbf.py and it works nicely. One gotcha: The memo file name extension must be lower case, i.e. .fpt, not .FPT which was how the filename came over from Windows.
I prefer dbfpy. It supports both reading and writing of .DBF files and can cope with most variations of the format. It's the only implementation I have found that could both read and write the legacy DBF files of some older systems I have worked with.
I was able to read a DBF file (with associated BAK, CDX, FBT, TBK files**) using the dbf package from PyPI http://pypi.python.org/pypi/dbf . I am new to python and know nothing about DBF files, but it worked easily to read a DBF file from my girlfriend's business (created with a music store POS application called AIMsi).
After installing the dbf package (I used aptitude and installed dbf version 0.88 I think), the following python code worked:
from dbf import *
test = Table("testfile.dbf")
for record in test:
print record
x = raw_input("") # to pause between showing records
That's all I know for now, but hopefully it's a useful start for someone else who finds this question!
April 21, 2012 SJK Edit: Per Ethan Furman's comment, I should point out that I actually don't know which of the data files were necessary, besides the DBF file. The first time I ran the script, with only the DBF available, it complained of a missing support file. So, I just copied over the BAK, CDX, FPT (not FBT as I said before edit), TBK files and then it worked.
If you're still checking this, I have a GPL FoxPro-to-PostgreSQL converter at https://github.com/kstrauser/pgdbf . We use it to routinely copy our tables into PostgreSQL for fast reporting.
You can try this recipe on Active State.
There is also a DBFReader module which you can try.
For support for memo fields.
Check out http://groups.google.com/group/python-dbase
It currently supports dBase III and Visual Foxpro 6.0 db files... not sure if the file layout change in VFP 9 or not...
It's 2016 now and I had to fiddle with the dbf package to get it to work. Here is a python3 version to just export a dbf file to a csv
import dbf
d=dbf.Table('mydbf.dbf')
d.open()
dbf.export(d, filename='mydf_exported.csv', format='csv', header=True)
I had some unicode error at first, but got around that by turning off memos.
import dbf
d=dbf.Table('mydbf.dbf', ignore_memos=True)
d.open()
dbf.export(d, filename='mydf_exported.csv', format='csv', header=True)