I found several questions that were similar to mine, but none of the answers came close to what I need.
Specifications: I'm working with Python 3 and do not have MS Word. My programming machine is running OS X and cloud machine is linux/ubuntu too.
I'm using python-docx to extract values from a .doc file that is sent to me nightly. However, python-docx only works with .docx files, so I need to convert the file to that extension first.
So, I've got a .doc file that I need to convert to .docx. This script might have to run in the cloud so I can't install any kind of Office or Office-like software. Can this be done?
You are working with Linux/ubuntu, you can use LibreOffice’s inbuilt converter.
SYNTAX
lowriter --convert-to docx *.doc
Example
lowriter --convert-to docx testdoc.doc
This will convert all doc files to docx and save in the same folder itself.
You could use unoconv - Universal Office Converter. Convert between any document format supported by LibreOffice/OpenOffice.
unoconv -d document --format=docx *.doc
subprocess.call(['unoconv', '-d', 'document', '--format=docx', filename])
Aspose.Words Cloud SDK for Python can convert DOC to DOCX. The package can open, generate, edit, split, merge, compare and convert a Word document in Python on any platform without depending on MS Word.
It is a paid product, but the free plan provides 150 free monthly API calls.
P.S: I'm developer evangelist at Aspose.
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Get your credentials from https://dashboard.aspose.cloud (free registration is required).
words_api = asposewordscloud.WordsApi(app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx',app_key='xxxxxxxxxxxxxxxxxxxxxxxxx')
words_api.api_client.configuration.host = 'https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.doc'
dest_name = 'C:/Temp/02_pages.docx'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='docx')
result = words_api.convert_document(request)
copyfile(result, dest_name)
import aspose.words as aw
path1="doc file path"
path2="path to save converted file"
file2=file.rsplit('.',1)[0]+'.docx'
filename1=os.path.join(path2,file2)
filename=os.path.join(path1,file)
doc = aw.Document(filename)
doc.save(filename1)
First you will need to be using Windows. If that is an acceptable barrier then please read on....
Next you need to install the Microsoft Office Compatibility Pack.
Now download and install the Microsoft Office Migration Planning Manager.
To run the tool you need to create a .ini file that controls the program. An example .ini file and further information is available on this blog post.
There is more detailed information from Microsoft here.
Related
👉The original file is docx format, which has multiple tables, but there may be format problems, so it cannot be read by python-docx.
✔️ 1.Solution by hand:
solve the question by click [save as ....] menu. A prompt box appears:
prompt box : appears upgrade to newest
❓2. Question:
How to implement [save as] function through Python-docx, upgrade the docx format to the latest?
😃Thanks for any suggestion!
3. appendix
from docx import Document
from win32com import client as wc
file = 'D:\\1.docx'
word = wc.Dispatch("Word.Application")
word.Visible = False
doc = word.Documents.Open(file)
doc.SaveAs("{}".format(file), 12)
doc.Close()
word.Quit()
With a compromise method,
we first created a blank DOCX,
then using Win32 libraries to copy the content as a whole to the blank DOCX,
testing available
Still looking forward to optimization methods
I am trying to convert doc file into docx. I found this code online.
subprocess.call(['soffice', '--headless', '--convert-to', 'docx', filename])
document = docx.Document(path[:-4] + ".docx")
docText = ''.join([
paragraph.text.encode('ascii', 'ignore') for paragraph in
document.paragraphs
It works perfectly fine with I use it on my own machine but I am trying to put this one AWS. It doesn't work there. I get an error saying "No such file or directory".
What could be the reason that it works on my computer but when I put it on AWS it doesnt.
You must have LibreOffice installed in the machine where ever you are using this code and you must close open instances of LibreOffice before running this, or it will exit silently without doing anything.
You can also try
unoconv -d document --format=docx *.doc
But it also dependent on LibreOffice. It will convert the files through LibreOffice. It is imperfect, and some formatting is lost, but it will convert all doc files to docx
I have a pdf file and i want to replace some text in pdf file and generate new pdf. How can i do that in python?
I have tried reportlab , reportlab does not have any fucntion to search text and replace it. What other module can i use?
You can try Aspose.PDF Cloud SDK for Python, Aspose.PDF Cloud is a REST API PDF Processing solution. It is paid API and its free package plan provides 50 credits per month.
I'm developer evangelist at Aspose.
import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
# Get App key and App SID from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
copied_file= '02_pages_new.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)
#upload PDF file to storage
pdf_api.copy_file(remote_name,copied_file)
#Replace Text
text_replace = asposepdfcloud.models.TextReplace(old_value='origami',new_value='polygami',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace])
response = pdf_api.post_document_text_replace(copied_file, text_replace_list)
print(response)
Have a look in THIS thread for one of the many ways to read text from a PDF. Then you'll need to create a new pdf, as they will, as far as I know, not retrieve any formatting for you.
The CAM::PDF Perl Library can output text that's not too hard to parse (it seems to fairly randomly split lines of text). I couldn't be bothered to learn too much Perl, so I wrote these really basic Perl command line scripts, one that reads a single page pdf to a text file perl read.pl pdfIn.pdf textOut.txt and one that writes the text (that you can modify in the meantime) to a pdf perl write.pl pdfIn.pdf textIn.txt pdfOut.pdf.
#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";
$pdfIn = $ARGV[0];
$textOut = $ARGV[1];
$pdf = CAM::PDF->new($pdfIn);
$page = $pdf->getPageContent(1);
open(my $fh, '>', $textOut);
print $fh $page;
close $fh;
exit;
and
#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";
$pdfIn = $ARGV[0];
$textIn = $ARGV[1];
$pdfOut = $ARGV[2];
$pdf = CAM::PDF->new($pdfIn);
my $page;
open(my $fh, '<', $textIn) or die "cannot open file $filename";
{
local $/;
$page = <$fh>;
}
close($fh);
$pdf->setPageContent(1, $page);
$pdf->cleanoutput($pdfOut);
exit;
You can call these with python either side of doing some regex etc stuff on the outputted text file.
If you're completely new to Perl (like I was), you need to make sure that Perl and CPAN are installed, then run sudo cpan, then in the prompt install "CAM::PDF";, this will install the required modules.
Also, I realise that I should probably be using stdout etc, but I was in a hurry :-)
Also also, any ideas what the format CAM-PDF outputs is? is there any doc for it?
I have been provided with a xlsb file full of data. I want to process the data using python. I can convert it to csv using excel or open office, but I would like the whole process to be more automated. Any ideas?
Update: I took a look at this question and used the first answer:
import subprocess
subprocess.call("cscript XlsToCsv.vbs data.xlsb data.csv", shell=False)
The issue is the file contains greek letters so the encoding is not preserved. Opening the csv with Notepad++ it looks as it should, but when I try to insert into a database comes like this ���. Opening the file as csv, just to read text is displayed like this:
\xc2\xc5\xcb instead of ΒΕΛ.
I realize it's an issue in encoding, but it's possible to retain the original encoding converting the xlsb file to csv ?
I've encountered this same problem and using pyxlsb does it for me:
from pyxlsb import open_workbook
with open_workbook('HugeDataFile.xlsb') as wb:
for sheetname in wb.sheets:
with wb.get_sheet(sheetname) as sheet:
for row in sheet.rows():
values = [r.v for r in row] # retrieving content
csv_line = ','.join(values) # or do your thing
Most popular Excel python packages openpyxl and xlrd have no support for xlsb format (bug tracker entries: openpyxl, xlrd).
So I'm afraid there is no native python way =/. However, since you are using windows, it should be easy to script the task with external tools.
I would suggest taking look at Convert XLS to XLSB Programatically?. You mention python in title but the matter of the question does not imply you are strongly coupled to it, so you could go pure c# way.
If you feel really comfortable only with python one of the answers there suggests a command line tool under a fancy name of Convert-XLSB. You could script it as an external tool from python with subprocess.
I know this is not a good answer, but I don't think there is better/easier way as of now.
In my previous experience, i was handling converting xlsb using libreoffice command line utility,
In ruby i just execute system command to call libreoffice for converting xlsb format to csv:
`libreoffice --headless --convert-to csv your_xlsb_file.xlsb --outdir /path/csv`
and to change the encoding i use command line to using iconv, using ruby :
`iconv -f ISO-8859-1 -t UTF-8 your_csv_file.csv > new_file_csv.csv`
I also looked at the problem and the following worked for me. First opening the file in excel via python and than saving it to different file. Bit of a workaround but I like it more than other solutions. In example I use file format 6 which is CSV but you can also use other ones.
import win32com.client
excel = win32com.client.Dispatch("Excel.Application")
excel.DisplayAlerts = False
excel.Visible=False
doc = excel.Workbooks.Open("C:/users/A295998/Python/#TA1PROG3.xlsb")
doc.SaveAs(Filename="C:\\users\\A295998\\Python\\test5.csv",FileFormat=6)
doc.Close()
excel.Quit()
XLSB is a binary format and I don't think you'll be able to parse it with current python tools and packages. If you still want to somehow automate the process with python you can do what the others have told you and script that windows CLI tool. Calling the .exe from the command line with subprocess, and passing an array of the files you want to convert.
I.e: with a script similar to this one you could convert all the .xlsb files that you place in the "xlsb" folder to .csv format...
├── xlsb
│ ├── file1.xlsb
│ ├── file2.xlsb
│ └── file3.xlsb
└── xlsb_to_csv.py
xlsb_to_csv.py
#!/usr/bin/env python
import os
files = [f for f in os.listdir('./xlsb')]
for f in files:
subprocess.call("ConvertXLS.EXE " + str(f) + " --arguments", shell=True)
Note: the Windows command is pseudocode... I use a similar approach to batch-convert stuff in headless windows servers for testing purpouses. You just have to figure out the exe location and the windows command...
Hope it helps... good luck!
I think you can do this using pyuno. This blog entry shows how to convert xls files to csv, and as open office supports xlsb files since version 3.2, this code might just work for you. You will have to go through hassle of setting up the pyuno environment though..
The script you reference seem to use the ActiveX interface to Excel, and save via its Workbook.SaveAs method.
According to the MSDN documentation this method have a TextCodepage argument which may be helpful.
Sidenote: You can rewrite the VB script in python, see this question.
I've got a bunch of FoxPro (VFP9) DBF files on my Ubuntu system, is there a library to open these in Python? I only need to read them, and would preferably have access to the memo fields too.
Update: Thanks #cnu, I used Yusdi Santoso's dbf.py and it works nicely. One gotcha: The memo file name extension must be lower case, i.e. .fpt, not .FPT which was how the filename came over from Windows.
I prefer dbfpy. It supports both reading and writing of .DBF files and can cope with most variations of the format. It's the only implementation I have found that could both read and write the legacy DBF files of some older systems I have worked with.
I was able to read a DBF file (with associated BAK, CDX, FBT, TBK files**) using the dbf package from PyPI http://pypi.python.org/pypi/dbf . I am new to python and know nothing about DBF files, but it worked easily to read a DBF file from my girlfriend's business (created with a music store POS application called AIMsi).
After installing the dbf package (I used aptitude and installed dbf version 0.88 I think), the following python code worked:
from dbf import *
test = Table("testfile.dbf")
for record in test:
print record
x = raw_input("") # to pause between showing records
That's all I know for now, but hopefully it's a useful start for someone else who finds this question!
April 21, 2012 SJK Edit: Per Ethan Furman's comment, I should point out that I actually don't know which of the data files were necessary, besides the DBF file. The first time I ran the script, with only the DBF available, it complained of a missing support file. So, I just copied over the BAK, CDX, FPT (not FBT as I said before edit), TBK files and then it worked.
If you're still checking this, I have a GPL FoxPro-to-PostgreSQL converter at https://github.com/kstrauser/pgdbf . We use it to routinely copy our tables into PostgreSQL for fast reporting.
You can try this recipe on Active State.
There is also a DBFReader module which you can try.
For support for memo fields.
Check out http://groups.google.com/group/python-dbase
It currently supports dBase III and Visual Foxpro 6.0 db files... not sure if the file layout change in VFP 9 or not...
It's 2016 now and I had to fiddle with the dbf package to get it to work. Here is a python3 version to just export a dbf file to a csv
import dbf
d=dbf.Table('mydbf.dbf')
d.open()
dbf.export(d, filename='mydf_exported.csv', format='csv', header=True)
I had some unicode error at first, but got around that by turning off memos.
import dbf
d=dbf.Table('mydbf.dbf', ignore_memos=True)
d.open()
dbf.export(d, filename='mydf_exported.csv', format='csv', header=True)