This question already has answers here:
Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters
(2 answers)
Closed 5 years ago.
I downloaded tweets in Urdu language. When I read the csv file using pandas in python, the tweet is shown as follows:
Sample tweet text
Unicode code point
I want to convert this into utf-8.
When you are writing the tweet data to a file use .decode('utf-8')
And when you try to read data from that file use .encode('utf-8')
Here I am posting an example:
# -*- coding: utf-8 -*-
string1 = "آکاش کمار"
string2 = string1.decode('utf-8')
string3 = string2.encode('utf-8')
print(string3)
Related
This question already has answers here:
Url decode UTF-8 in Python
(5 answers)
Closed 6 months ago.
So I have the following string
"%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
It actually means this
ボドカさん
This string seems to be encoded in UTF-8 because when I write this in python
encoded_str = b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
print(encoded_str)
print(encoded_str.decode('utf-8'))
Here is the output I get
b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
ボドカさん
But now I would like a script that will allow me to decode any string in the initial format and here is my code.
import re
import os
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
mystr = mystr.lower()
mystr = re.sub('%', r'\\x', mystr)
encoded_str = bytes(mystr, "utf-8")
print(mystr)
print(encoded_str)
print(encoded_str.decode('utf-8'))
Output:
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
b'\\xe3\\x83\\x9c\\xe3\\x83\\x89\\xe3\\x82\\xab\\xe3\\x81\\x95\\xe3\\x82\\x93'
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
I tried so many possibilities but I couldn't find the right way to encode proprely my string like the b'STRING' thing would do. I always get extra \ characters from the encoding process that then spoil the decoding process too.
I tried all the encoding methods existing in python for the bytes() function.
I need help please. Thank you.
Stack overflow banned me for that question lol
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
encoded_str = bytes.fromhex(mystr.replace('%', ''))
print(encoded_str.decode('utf-8'))
Output:
ボドカさん
This question already has an answer here:
How to remove Byte Order Mark in python
(1 answer)
Closed 7 months ago.
I am trying to learn text processing. And using nltk.
Trying to follow the NLTK book.
When I try to read a text, it is reading it a little different.
import requests
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = requests.get(url)
response.text[:25]
How can I read the text without the highlighted part in the image uploaded.
This is a unicode format that you're seeing here.
What you should do is, convert the unicode string to ascii with ignore if not ascii.
Example:
a=u'\uffefHello World'
print(a.encode('ascii', 'ignore'))
"Hello World"
The simple answer is to print it and not put it just in the shell:
print(response.text[:25])
Should print:
The Project Gutenberg E8
The shell does repr on the value to find out what it should print
print(repr(response.text[25]))
will again print:
'\ufeffThe Project Gutenberg E8'
This question already has answers here:
Working with UTF-8 encoding in Python source [duplicate]
(2 answers)
Closed 8 years ago.
I want to split a string in python using this code:
means="a ، b ، c"
lst=means.split("،")
but I get this error message:
SyntaxError: Non-ASCII character '\xd8' in file dict.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How do I declare an encoding?
Put:
# -*- coding: UTF-8 -*-
as the first line of the file (or second line if using *nix) and save the file as UTF-8.
If you're using Python 2, use Unicode string literals (u"..."), for example:
means = u"a ، b ، c"
lst = means.split(u"،")
If you're using Python 3, string literals are Unicode already (unless marked as bytestrings b"...").
You need to declare an encoding for your file, as documented here and here.
This question already has answers here:
Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence
(12 answers)
Closed 7 months ago.
Dumping a string that contains unicode characters as json produces weird unicode escape sequences:
text = "⌂⚘いの法嫁"
print(text) # output: ⌂⚘いの法嫁
import json
json_text = json.dumps(text)
print(json_text) # output: "\u2302\u2698\u3044\u306e\u6cd5\u5ac1"
I'd like to get this output instead:
"⌂⚘いの法嫁"
How can I dump unicode characters as characters instead of escape sequences?
Call json.dumps with ensure_ascii=False:
json_string = json.dumps(json_dict, ensure_ascii=False)
On Python 2, the return value will be unicode instead of str, so you might want to encode it before doing anything else with it.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Character reading from file in Python
I want to strip a input string from a file from all special characters, except for actual letters (even Cyrillic letters shouldn't be stripped). The solution I found manually declares the string as unicode and the pattern with the re.UNICODE flag so actual letters from different languages are detected.
# -*- coding: utf-8 -*-
import re
pattern = re.compile("[^\w\d]",re.UNICODE)
n_uni = 'ähm whatßs äüöp ×äØü'
uni = u'ähm whatßs äüöp ×äØü'
words = pattern.split(n_uni) #doesn't work
u_words = pattern.split(uni) #works
So if I write the string directly in the source and manually define it as Unicode it gives me the desired output while the non-Unicode string gives me just garbage:
"ähm whatßs äüöp äØü" -> unicode
"hm what s ü p ü" -> non-unicode even with some invalid characters
My question is now how do I define the input from a file as Unicode?
My question is now how do I define the input from a file as unicode?
Straight from the docs.
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)