Python: raw_input should be converted to unicode - python

I get the users Input with raw_input(). This Input is stored in a variable.
How can I cast this variable to be Unicode. I Need this for further executions.
userinput = raw_input("Hello. What is your Name?")

Jst call the "decode" method on the result of "raw_input". Matter is, you need to know the encoding of the terminal where the input was made:
import sys
value = raw_input("bla bla bla").decode(sys.stdin.encoding or 'utf-8')
But yu really should use Python 3

Related

Create an input prompt for an integer

Hi I am a beginner for python. May I ask how do you write an input prompt that takes a single number as the input, which represents a new row. Thanks
In python, you can use input() to read the input from the keyboard so if you want to use it later you can use a variable to store the input
str_input = input("Enter the string")
so by using the variable str_input you can get the value for later use like for printing etc. for print the value use
print(str_input)
So if you need to read the number as input then you need to typecast because everything in python default as a string so for typecast use
num_int = int(input("Enter the number"))
So by this you can read the num

Python 3 print utf-8 encoded string problem

I'm requesting a string from a network-service. When I print it from within a program:
variable = getFromNetwork()
print(variable)
and I execute it using python3 net.py I get:
\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612
When I execute in the python3 CLI:
>>> print("\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612")
تÙ
Ù
Ù612
Buy when I execute in the python2 CLI I get the correct result:
>>> print("\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612")
تملي612
How I can print this in my program by python3?
Edit
After executing the following line:
print(print(type(variable), repr(variable)))
Got
<class 'str'> '\\xd8\\xaa\\xd9\\x85\\xd9\\x84\\xd9\\x8a612'
I think I should first remove\\x to make it hex and then decode it. What is your solutions!?
You need to specify the encoding, so the interpreter knows how to interpret the data:
s = "\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612"
y = s.encode('raw_unicode_escape')
print (y) # is a bytes object now!
print (y.decode('utf-8'))
Out:
b'\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612'
تملي612
Your variable is a (unicode) string that contains code for a UTF8 encoded byte string. It can happen because it was erroneously decoded with a wrong encoding (probably Latin1 here).
You can fix it by first converting to a byte string without changing the codes (so with a Latin1 encoding) and then you will be able to correctly decode it:
variable = getFromNetwork().encode('Latin1').decode()
print(variable)
Demo:
variable = "\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612"
print(variable.encode('Latin1').decode())
تملي612
in python 3 i tested with the following code
line='\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612'
line = line.encode('raw_unicode_escape')
line=line.decode("utf-8")
print(line)
it prints
تملي612

Why does Python 2's raw_input output unicode strings?

I tried the following on Codecademy's Python lesson
hobbies = []
# Add your code below!
for i in range(3):
Hobby = str(raw_input("Enter a hobby:"))
hobbies.append(Hobby)
print hobbies
With this, it works fine but if instead I try
Hobby = raw_input("Enter a hobby:")
I get [u'Hobby1', u'Hobby2', u'Hobby3']. Where are the extra us coming from?
The question's subject line might be a bit misleading: Python 2's raw_input() normally returns a byte string, NOT a Unicode string.
However, it could return a Unicode string if it or sys.stdin has been altered or replaced (by an application, or as part of an alternative implementation of Python).
Therefore, I believe #ByteCommander is on the right track with his comment:
Maybe this has something to do with the console it's running in?
The Python used by Codecademy is ostensibly 2.7, but (a) it was implemented by compiling the Python interpreter to JavaScript using Emscripten and (b) it's running in the browser; so between those factors, there could very well be some string encoding and decoding injected by Codecademy that isn't present in plain-vanilla CPython.
Note: I have not used Codecademy myself nor do I have any inside knowledge of its inner workings.
'u' means its a unicode. You can also specify raw_input().encode('utf8') to convert to string.
Edited:
I checked in python 2.7 it returns byte string not unicode string. So problem is something else here.
Edited:
raw_input() returns unicode if sys.stdin.encoding is unicode.
In codeacademy python environment, sys.stdin.encoding and sys.stdout.decoding both are none and default endcoding scheme is ascii.
Python will use this default encoding only if it is unable to find proper encoding scheme from environment.
Where are the extra us coming from?
raw_input() returns Unicode strings in your environment
repr() is called for each item of a list if you print it (convert to string)
the text representation (repr()) of a Unicode string is the same as Unicode literal in Python: u'abc'.
that is why print [raw_input()] may produce: [u'abc'].
You don't see u'' in the first code example because str(unicode_string) calls the equivalent of unicode_string.encode(sys.getdefaultencoding()) i.e., it converts Unicode strings to bytestrings—don't do it unless you mean it.
Can raw_input() return unicode?
Yes:
#!/usr/bin/env python2
"""Demonstrate that raw_input() can return Unicode."""
import sys
class UnicodeFile:
def readline(self, n=-1):
return u'\N{SNOWMAN}'
sys.stdin = UnicodeFile()
s = raw_input()
print type(s)
print s
Output:
<type 'unicode'>
☃
The practical example is win-unicode-console package which can replace raw_input() to support entering Unicode characters outside of the range of a console codepage on Windows. Related: here's why sys.stdout should be replaced.
May raw_input() return unicode?
Yes.
raw_input() is documented to return a string:
The function then reads a line from input, converts it to a string
(stripping a trailing newline), and returns that.
String in Python 2 is either a bytestring or Unicode string :isinstance(s, basestring).
CPython implementation of raw_input() supports Unicode strings explicitly: builtin_raw_input() can call PyFile_GetLine() and PyFile_GetLine() considers bytestrings and Unicode strings to be strings—it raises TypeError("object.readline() returned non-string") otherwise.
You could encode the strings before appending them to your list:
hobbies = []
# Add your code below!
for i in range(3):
Hobby = raw_input("Enter a hobby:")
hobbies.append(Hobby.encode('utf-8')
print hobbies

Special characters appearing as question marks

Using the Python programming language, I'm having trouble outputting characters such as å, ä and ö. The following code gives me a question mark (?) as output, not an å:
#coding: iso-8859-1
input = "å"
print input
The following code lets you input random text. The for-loop goes through each character of the input, adds them to the string variable a and then outputs the resulting string. This code works correctly; you can input å, ä and ö and the output will still be correct. For example, "år" outputs "år" as expected.
#coding: iso-8859-1
input = raw_input("Test: ")
a = ""
for i in range(0, len(input)):
a = a + input[i]
print a
What's interesting is that if I change input = raw_input("Test: ") to input = "år", it will output a question mark (?) for the "å".
#coding: iso-8859-1
input = "år"
a = ""
for i in range(0, len(input)):
a = a + input[i]
print a
For what it's worth, I'm using TextWrangler, and my document's character encoding is set to ISO Latin 1. What causes this? How can I solve the problem?
You're using Python 2, I assume running on a platform like Linux that encodes I/O in UTF-8.
Python 2's "" literals represent byte-strings. So when you specify "år" in your ISO 8859-1-encoded source file, the variable input has the value b'\xe5r'. When you print this, the raw bytes are output to the console, but show up as a question-mark because they are not valid UTF-8.
To demonstrate, try it with print repr(a) instead of print a.
When you use raw_input(), the user's input is already UTF-8-encoded, and so are correctly output.
To fix this, either:
Encode your string as UTF-8 before printing it:
print a.encode('utf-8')
Use Unicode strings (u'text') instead of byte-strings. You will need to be careful with decoding the input, since on Python 2, raw_input() returns a byte-string rather than a text string. If you know the input is UTF-8, use raw_input().decode('utf-8').
Encode your source file in UTF-8 instead of iso-8859-1. Then the byte-string literal will already be in UTF-8.

use raw_input generated string safely (Python)

I'm wondering how to use a string from raw_input safely so that I can create a function to replace it for a script that is meant to be used easily and securely.
The reason is that I am trying to make a character sheet generating application using python and need to be able to get a character's full name to pass as a string using a name for easy access (Charname_NLB)
However, as I'm looking to use this for more than that application, I need this to be usable for any string entered as raw input, using this alternate command.
I already have a similar piece made for input of integers and would like to integrate it into the same class, for simplicity's sake. I'll post it here, with thanks to: Mgilson and BlueKitties (from here and www.python-forum.org respectively)
def safeinput(get_num):
num = float(raw_input(get_num))
return num
However if this would not return the same result as the base Input command safely, could I please get an working copy, as I currently have only one proof of concept to work with, and it wouldn't be accurate with truncated numbers.
**Edit: By "Any string", I mean specifically that the result will be stored as a string, not used as a command.
Not sure if this is what you are asking for. literal_eval is safe, but only works for literals. It's very difficult to use eval() safely if you have to sanitise the input
>>> from ast import literal_eval
>>> def safeinput(s):
... try:
... return literal_eval(s)
... except:
... return s
...
>>> repr(safeinput("1"))
'1' # converted to an int
>>> repr(safeinput("1.1"))
'1.1' # converted to a float
>>> repr(safeinput("'some string in quotes'"))
"'some string in quotes'" # converted to a string
>>> repr(safeinput("some string without quotes"))
"'some string without quotes'" # no conversion necessary

Categories