Python convert doc to docx

Python convert doc to docx - python

I am trying to convert doc file into docx. I found this code online.
subprocess.call(['soffice', '--headless', '--convert-to', 'docx', filename])
document = docx.Document(path[:-4] + ".docx")
docText = ''.join([
paragraph.text.encode('ascii', 'ignore') for paragraph in
document.paragraphs
It works perfectly fine with I use it on my own machine but I am trying to put this one AWS. It doesn't work there. I get an error saying "No such file or directory".
What could be the reason that it works on my computer but when I put it on AWS it doesnt.

You must have LibreOffice installed in the machine where ever you are using this code and you must close open instances of LibreOffice before running this, or it will exit silently without doing anything.
You can also try
unoconv -d document --format=docx *.doc
But it also dependent on LibreOffice. It will convert the files through LibreOffice. It is imperfect, and some formatting is lost, but it will convert all doc files to docx

Related

Converting doc to docx using python

I am trying to convert .doc documents to .docx documents using python. Getting inspiration from this post, I have tried the following code :
import subprocess
import glob
import os
root = "//PARADFS101/7folder/LIAGREV/Documents/RFP/"
data_path = root + '/data2/'
os.chdir(data_path)
for doc in glob.iglob("*.doc"):
print(doc)
subprocess.call(['soffice', '--headless', '--convert-to', 'docx', doc], shell = True)
But unfortunately litterally nothing happens, i.e. I get no error message, the code is running, the docs are detected (which I check thanks to print) but I don't get any result. Any idea how I may troubleshoot this ?
EDITS :
I am running on Windows, hence shell = True
I have tried double quotes : '"
I have tried without spaces in the names
When I execute the subprocess command on one file alone, I get 1as output, which I don't knowhow to interpret...

Python & MS Word: Convert .doc to .docx?

I found several questions that were similar to mine, but none of the answers came close to what I need.
Specifications: I'm working with Python 3 and do not have MS Word. My programming machine is running OS X and cloud machine is linux/ubuntu too.
I'm using python-docx to extract values from a .doc file that is sent to me nightly. However, python-docx only works with .docx files, so I need to convert the file to that extension first.
So, I've got a .doc file that I need to convert to .docx. This script might have to run in the cloud so I can't install any kind of Office or Office-like software. Can this be done?

You are working with Linux/ubuntu, you can use LibreOffice’s inbuilt converter.
SYNTAX
lowriter --convert-to docx *.doc
Example
lowriter --convert-to docx testdoc.doc
This will convert all doc files to docx and save in the same folder itself.

You could use unoconv - Universal Office Converter. Convert between any document format supported by LibreOffice/OpenOffice.
unoconv -d document --format=docx *.doc
subprocess.call(['unoconv', '-d', 'document', '--format=docx', filename])

Aspose.Words Cloud SDK for Python can convert DOC to DOCX. The package can open, generate, edit, split, merge, compare and convert a Word document in Python on any platform without depending on MS Word.
It is a paid product, but the free plan provides 150 free monthly API calls.
P.S: I'm developer evangelist at Aspose.
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Get your credentials from https://dashboard.aspose.cloud (free registration is required).
words_api = asposewordscloud.WordsApi(app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx',app_key='xxxxxxxxxxxxxxxxxxxxxxxxx')
words_api.api_client.configuration.host = 'https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.doc'
dest_name = 'C:/Temp/02_pages.docx'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='docx')
result = words_api.convert_document(request)
copyfile(result, dest_name)

import aspose.words as aw
path1="doc file path"
path2="path to save converted file"
file2=file.rsplit('.',1)[0]+'.docx'
filename1=os.path.join(path2,file2)
filename=os.path.join(path1,file)
doc = aw.Document(filename)
doc.save(filename1)

First you will need to be using Windows. If that is an acceptable barrier then please read on....
Next you need to install the Microsoft Office Compatibility Pack.
Now download and install the Microsoft Office Migration Planning Manager.
To run the tool you need to create a .ini file that controls the program. An example .ini file and further information is available on this blog post.
There is more detailed information from Microsoft here.

Issues in copying files on iOS using NSStreams

I am trying to copy image and media files using NSStreams. I can not use NSFileManager copyItemAtPath, as I have to copy the file using streams.
The data is transferred over the network and the stream is read by a Python script that writes the data to a file. This worked fine on Mac OSX but when I tried in iOS,the file was not saved in the proper format.
I am able to copy all the files, but some of the metadata like dimensions (for image and media files), and duration (for media files) is missing in the copied file, and the kind is always Document. The other metadata is fine.
When I try to read the file attributes using the NSFileManager
[[NSFileManager defaultManager] attributesOfItemAtPath:#"filePath" error:&error];
It shows an error in the console:
The operation couldn't be completed. No such file or directory
I also observed that all the copied files, irrespective of the file extension (.png,.jpeg,.mov, .zip), has a kind of Document
How do I copy the source image metadata into the copied file?
Are there any Xcode optimizations I need to turn off?
OS : Mac OSX 10.8.4, iOS 6
Xcode : 4.6.3

This works for me for any type of file:
if (![NSFileManager.defaultManager copyItemAtPath: sourceFileName toPath: targetFileName error: &error]) {
NSAlert *alert = [NSAlert alertWithError: error];
[alert runModal];
return;
}

I found out it is an issue with file extension.Some junk characters appended after the file extension( something like 1.png\\\)

Why does Python not recognize filesize of CSV immediately after unzipping?

I have a python (v2.7.2 on OSX Lion) script that unzips an archive into a new folder, and then finds a csv within those files. It then attempts to open the CSV and read through it.
This was all working as expected, up to a point. The problem i have been running into is that, when executed as described above, at times the script perceives the file to be 0 length. But when I run the same code from the command line interpreter, it sees the file correctly. Can anyone help me understand what the reason for this might be?
Pseudo Code:
# unzip the archive, locating csvfile along the way...
statinfo = os.stat(unzip_dir + "/" + csvfile)
print statinfo
output from the above snippet:
posix.stat_result(st_mode=33188, st_ino=5318966, st_dev=234881026L, st_nlink=1, st_uid=0, st_gid=80, st_size=0, st_atime=1329963124, st_mtime=1329963124, st_ctime=1329963124)
(notice st_size=0!)
Now I go directly to the python command line and enter:
import os
statinfo = os.stat("/Users/Me/Testdir/test.csv")
print statinfo
Output from the above snippet:
posix.stat_result(st_mode=33188, st_ino=5318966, st_dev=234881026L, st_nlink=1, st_uid=0, st_gid=80, st_size=290, st_atime=1329963124, st_mtime=1329963124, st_ctime=1329963124)
As we see, st_size is now seen by Python.
I'm stumped. Any ideas? I can post more code if necessary. Thank you.

You probably just forgot to flush.

Creating pdfs in Python with Pisa / xhtml2pdf

I know there are a lot of questions based on pdf creation in Python but I haven't seen anything based on creating pdfs with Pisa or xhtml2pdf.
Here is my code.
pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),file('mypdf.pdf','wb'))
and then
pisa.startViewer('mypdf.pdf')
I assembled this over a couple different tutorials and examples but every single thing that I've tried always results in the pdf being corrupted and I get this message when trying to open the pdf.
"Adobe Reader could not open 'awesomer.pdf' because it is either not a supported file type or because the file has been damaged (for example, it was sent as an email attachment and wasn't correctly decoded)."
This message occurs even when I don't use the .encode('utf-8') on the string.
What am I doing wrong? Does the encoding on my Mac have to do with this?

I'd suggest closing the file manually, had a simmilar problem. Try this:
f = file('mypdf.pdf', 'wb')
pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),f)
f.close()

I recommend doing the following:
pdf = pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),file('mypdf.pdf','wb'))
if pdf.err:
print "*** %d ERRORS OCCURED" % pdf.err
And then see what the error output is.
I'm not sure what string you are encoding but this might also help:
pdf = pisa.pisaDocument(cStringIO.StringIO(html.encode(a)).encode('utf-8'),file('mypdf.pdf','wb'))
It depends on if a needs to be html encoded

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python convert doc to docx - python

Related

Converting doc to docx using python

Python & MS Word: Convert .doc to .docx?

Issues in copying files on iOS using NSStreams

Why does Python not recognize filesize of CSV immediately after unzipping?

Creating pdfs in Python with Pisa / xhtml2pdf

Categories

Resources