Unicode-strings from xlrd - python

I'm trying to read some information from an excel-file using the xlrd-module. This works fine most of the time, but when the script encounters any scandinavian letters the script stops. I've been reading several posts about unicode and encoding, but I must admit I'm not familiar with it.
The cell I'm reading contains text (string) and is being read as unicode (as normal with xlrd). One example of a value that fails is Glørmestervej and it is read by xlrd as u'Gl\xf8mestervej. If I try to print the variable, the script stops. I've had most success by encoding the value with latin1:
print cellValue.encode("latin1")
which gives the result Glormestervej, but with a KeyError.
How do I get the variable to become a string with ø instead of \xf8? The reason is that I need to use it as an input to another service and it does not seem to work using unicode.
Regards, Torbjørn

I'm happy to say the problem have been solved, in fact there were not any error after all. There were some permission-issues with the user that I used for calling the service in which the variable was used. Thank you for your response!

Related

Fontforge python implementation won't accept unicode value in glyphpen definition

I'm trying to convert a CAD font file to ttf for use with HTML using Python and Fontforge.
The program reads the fontfile data:
data=f.read(4)
glyph['offset']=f.tell()
glyph['glyphname']=data[1]*256+data[0]
glyph['pathsize']=((data[3]<<8)&0xff00)+(data[2]&0xff)
(Forgive the weird manipulation of the data bytes: I have been trying various ways of inputting the data in case there's something I'm doing wrong).
I then define the glyph by creating my character
uniname=glyph['glyphname']
char=font.createChar(uniname)
pen=font[uniname].glyphPen()
This works fine until I get to the unicode character 260, when pdb tells me that there is a TypeError: Index out of bounds.
The funny thing is that, if I run the following instead:
for i in range(253,280):
uniname=i
print(uniname)
char=font.createChar(uniname)
pen=font[uniname].glyphPen()
Then it happily accepts all the values without complaint.
I'm baffled.
I finally got this to work.
Instead of doing this:
char=font.createChar(uniname)
pen=font[uniname].glyphPen()
I did this:
char=font.createChar(uniname)
pen=char.glyphPen()
In the first example, the glyph is created for the uniname character using font.createChar() and the pen is assigned from the font's list of characters.
In the second example, the gyph is created as before, but the pen is assigned directly and I no longer get the 'index out of bounds' error.
I have no idea why this works, but hope it will help someone else with similar issues.

Can we remove the input function's line length limit purely within Python? [duplicate]

I'm trying to input() a string containing a large paste of JSON.
(Why I'm pasting a large blob of json is outside the scope of my question, but please believe me when I say I have a not-completely-idiotic reason!)
However, input() only grabs the first 4095 characters of the paste, for reasons described in this answer.
My code looks roughly like this:
import json
foo = input()
json.loads(foo)
When I paste a blob of JSON that's longer than 4095 characters, json.loads(foo) raises an error. (The error varies based on the specifics of how the JSON gets cut off, but it invariably fails one way or another because it's missing the final }.)
I looked at the documentation for input(), and it made no mention of anything that looked useful for this issue. No flags to input in non-canonical mode, no alternate input()-style functions to handle larger inputs, etc.
Is there a way to be able to paste large inputs successfully? This would make my tool's workflow way less janky than having to paste into a file, save it somewhere, and then pass the file's location into the script.
Python has to follow the terminal rules. But you could use a system call from python to change terminal behaviour and change it back (Linux):
import subprocess,json
subprocess.check_call(["stty","-icanon"])
result = json.loads(input())
subprocess.check_call(["stty","icanon"])
Alternately, consider trying to get an indented json dump from your provider that you can read line by line, then decode.
data = "".join(sys.stdin.readlines())
result = json.loads(data)

python :same character, different behavior

I'm generating file names from a list pulled out from a postgres DB with Python 2.7.9. In this list there are words with special char. Normally I use ''.join() to record the name and fire it to my loader but I have just one name that want be recognized. the .py is set for utf-8 coding, but the words are in Portuguese, I think latin-1 coding.
from pydub import AudioSegment
from pydub.playback import play
templist = ['+ Orégano','- Búfala','+ Rúcola']
count_ins = (len(templist)-1)
while (count_ins >= 0 ):
kot_istructions = AudioSegment.from_ogg('/home/effe/voice_orders/Voz/'+"".join(templist[count_ins])+'.ogg')
count_ins-=1
play(kot_istructions)
The first two files are loaded:
/home/effe/voice_orders/Voz/+ Orégano.ogg
/home/effe/voice_orders/Voz/- Búfala.ogg
The third should be:
/home/effe/voice_orders/Voz/+ Rúcola.ogg
But python is trying to load
/home/effe/voice_orders/Voz/+ R\xc3\xbacola.ogg
Why just this one? I've tried to use normalize() to remove the accent but since this is a string the method didn't work.
Print works well, as db update. Just file name creation doesn't works as expected.
Suggestions?
It seems the root cause might be that the encoding of these names in inconsisitent within your database.
If you run:
>>> 'R\xc3\xbacola'.decode('utf-8')
You get
u'R\xfacola'
which is in fact a Python unicode, correctly representing the name. So, what should you do? Although it's a really unclean programming style, you could play .encode()/.decode() whackamole, where you try to decode the raw string from your db using utf-8, and failing that, latin-1. It would look something like this:
try:
clean_unicode = dirty_string.decode('utf-8')
except UnicodeDecodeError:
clean_unicode = dirty_string.decode('latin-1')
As a general rule, always work with clean unicode objects within your own source, and only convert to an encoding on saving it out. Also, don't let people insert data into a database without specifying the encoding, as that will stop you from having this problem in the first place.
Hope that helps!
Solved: Was a problem with the file. Deleting and build it again do the job.

Net-SNMP returns HexString and then just String (Eclipse and Pydev)

I am doing an snmpget using Net-SNMP. Specifically I am sending a command via os.popen("etc"). The value returned is a Hex-string separated by spaces, something like this : "A0 f0 D0". The returned value comes sometimes in the form :"Hex-String: A0 f0 D0.." but sometimes comes in the form "String:\xA0\xf0\xD0" where, as you can see, the spaces are filled with "\x". Does anyone have an idea as to why this might be happening? I would prefer it if the returned value was the HEX-String with spaces, not \x.
I should note that I am using Eclipse with Pydev. I then ran the same code in pyscripter and got back my Hex-String value. I ran it again in Pyscripter and then the \x's returned. Is this something to do with an unclosed pipe?
I should also mention that the data I am getting back is bad in another sense. The Hex-String with spaces returns proper data values, but the String with \xs returns values that are not correct.
I have used Wireshark and it looks like the get request is exactly the same as one sent from the MIB. The MIB request returns the correct data, while the Eclipse request still returns bad data.
PyDev does one thing differently, which is setting: sys.setdefaultencoding(encoding) with the encoding of the java console (so that if you print unicode to the console it won't fail saying that the unicode doesn't decode as ascii). To see if this is your problem, you can go to eclipse\plugins\org.python.pydev\PySrc\pydev_sitecustomize\sitecustomize.py and comment the line which does: sys.setdefaultencoding(encoding)

python2.7 - reading a dictionary from a .txt file riddled with unicode

I enrolled into a Chinese Studies course some time ago, and I thought it'd be a great exercise for me to write a flashcard program in python. I'm storing the flash card lists in a dictionary in a .txt file, so far without trouble. The real problems kick in when I try to load the file, encoded in utf-8, into my program. An excerpt of my code:
import codecs
f = codecs.open(('list.txt'),'r','utf-8')
quiz_list = eval(f.read())
quizy = str(quiz_list).encode('utf-8')
print quizy
Now, if for example list.txt consists of:
{'character1':'男人'}
what is printed is actually
{'character1': '\xe7\x94\xb7\xe7\x86\xb1'}
Obviously there are some serious encoding issues here, but I cannot for the life of me understand where these occur. I am working with a terminal which supports utf-8, so not the standard cmd.exe: this is not the problem. Reading a normal list.txt without the curly dict-bits returns the chinese characters without a problem, so my guess is I'm not handling the dictionary part correctly. Any thoughts would be greatly appreciated!
There's nothing wrong with your encoding... Look at this:
>>> d = {1:'男人'}
>>> d[1]
'\xe7\x94\xb7\xe4\xba\xba'
>>> print d[1]
男人
One thing is to print a unicode string another one is printing its representation.
str(quizy) calls repr(quizy['character1']) which produces an ASCII representation of the string value. If you just print quizy['character1'] you'll see that the character codes are Unicode in the Python string.

Categories