Encoding problems using popen

Encoding problems using popen - python

I got a InDesign page filled with 4 pictures.
Im using
ex = os.popen('exiftool -T -IngredientsFilePath '+item).read()
to get back all pictures and their filepath used in that document
But when i try to write this answer into a file it seems the encoding went went wrong,
or the answer from that command is already in any other encoding.
All ü,Ü,ä,Ä,ö,Ö,',',ß or spaces look like this : u%cc%88, o%cc%88, %c3%9f and so on.
Does anyone know how to fix this? At the moment im using .replace('%c3%9f', 'ß') for ex
to get the right encoding but im not srsly lucky with that.
Greetings Yierith

Related

Wrong character encoding of rtf file

When I copy and paste the sentence How brave they’ll all think me at home! into a blank TextEdit rtf document on the Mac, it looks fine. But if I create an an apparently identical rtf file programatically, and write the same sentence into it, on opening TextEdit it appears as How brave theyâ€™ll all think me at home! In the following code, output is OK, but the file when viewed in TextEdit has problems with the right single quotation mark (here used as an apostrophe), unicode U-2019.
header = r"""{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf400
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 """
sen = 'How brave they’ll all think me at home!'
with open('staging.rtf', 'w+’) as f:
f.write(header)
f.write(sen)
f.write('}')
with open('staging.rtf') as f:
output = f.read()
print(output)
I’ve discovered from https://www.i18nqa.com/debug/utf8-debug.html that this may be caused by “UTF-8 bytes being interpreted as Windows-1252”, and that makes sense as it seems that ansicpg1252 in the header indicates US Windows.
But I still can’t work out how to fix it, even having read the similar issue here: Encoding of rtf file. I’ve tried replacing ansi with mac without effect. And adding ,encoding='utf8' to the open function doesn’t seem to help either.
(The reason for using rtf by the way is to be able to export sentences with colour-coded words, allow them to be manually edited, then read back in for further processing).

OK, I've found the answer myself. I needed to use , encoding='windows-1252' both when writing to the rtf file and also when reading from it.

Trying to use json load with txt file

import json
f=open("99_jiayi.txt",'r',encoding = 'utf-8-sig')
a=json.load(f)
f.close()
https://i.stack.imgur.com/vD3M5.png
https://i.stack.imgur.com/QP0oW.png
I tried to turn the text file into a list in order to analyze the data
I used the "json load" and it worked on another file which is written in the same mode
But when i want to use it on another file it comes out the error
i searched google for a lot of time but still cant get the ans
Hope someone can help me with this question
i have some problem to express my thought with eng so if anyone cant understand what i am typing plz let me know tks!!

The error message points to "1":NR, which does not look like valid JSON, so it seems like a valid error.
Edit: try putting all NR within quotes.

Proper format for a file name in python

I'm importing an mp3 file using IPython (more specifically, the IPython.display.display(IPython.display.Audio() command) and I wanted to know if there was a specific way you were supposed to format the file path.
The documentation says it takes the file path so I assumed (perhaps incorrectly) that it should be something like \home\downloads\randomfile.mp3 which I used an online converter to convert into unicode. I put that in (using, of course, filename=u'unicode here' but that didn't work, instead giving a bunch of errors. I tried reformatting it in different ways (just \downloads\randomfile.mp3, etc) but none of them worked. For those who are curious, here is the unicode characters: \u005c\u0044\u006f\u0077\u006e\u006c\u006f\u0061\u0064\u0073\u005c\u0062\u0064\u0061\u0079\u0069\u006e\u0073\u0074\u0072\u0075\u006d\u0065\u006e\u0074\u002e\u006d\u0070\u0033 which translates to \home\Downloads\bdayinstrument.mp3, I believe.
So, am I doing something wrong? What is the correct way to format the "file path"?
Thanks!

Creating pdfs in Python with Pisa / xhtml2pdf

I know there are a lot of questions based on pdf creation in Python but I haven't seen anything based on creating pdfs with Pisa or xhtml2pdf.
Here is my code.
pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),file('mypdf.pdf','wb'))
and then
pisa.startViewer('mypdf.pdf')
I assembled this over a couple different tutorials and examples but every single thing that I've tried always results in the pdf being corrupted and I get this message when trying to open the pdf.
"Adobe Reader could not open 'awesomer.pdf' because it is either not a supported file type or because the file has been damaged (for example, it was sent as an email attachment and wasn't correctly decoded)."
This message occurs even when I don't use the .encode('utf-8') on the string.
What am I doing wrong? Does the encoding on my Mac have to do with this?

I'd suggest closing the file manually, had a simmilar problem. Try this:
f = file('mypdf.pdf', 'wb')
pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),f)
f.close()

I recommend doing the following:
pdf = pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),file('mypdf.pdf','wb'))
if pdf.err:
print "*** %d ERRORS OCCURED" % pdf.err
And then see what the error output is.
I'm not sure what string you are encoding but this might also help:
pdf = pisa.pisaDocument(cStringIO.StringIO(html.encode(a)).encode('utf-8'),file('mypdf.pdf','wb'))
It depends on if a needs to be html encoded

pyPdf unable to extract text from some pages in my PDF

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:
http://www.4shared.com/document/kmJF67E4/forms.html
If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?
from pyPdf import PdfFileReader
input = PdfFileReader(file("forms.pdf", "rb"))
for page in input1.pages:
print page.extractText()

Note that extractText() still has problems extracting the text properly. From the documentation for extractText():
This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.
Since it is the text you want, you can use the Linux command pdftotext.
To invoke that using Python, you can do this:
>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
The text is extracted from forms.pdf and saved to output.
This works in the case of your PDF file and extracts the text you want.

This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.

You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.

I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.

I had similar problem with some pdfs and for windows, this is working excellent for me:
1.- Download Xpdf tools for windows
2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64
3.- use subprocess to run command from console:
import subprocess
try:
extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
print (e)

I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.
I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding problems using popen - python

Related

Wrong character encoding of rtf file

Trying to use json load with txt file

Proper format for a file name in python

Creating pdfs in Python with Pisa / xhtml2pdf

pyPdf unable to extract text from some pages in my PDF

Categories

Resources