unicode code points to utf-8 python [duplicate] - python

This question already has answers here:
Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters
(2 answers)
Closed 5 years ago.
I downloaded tweets in Urdu language. When I read the csv file using pandas in python, the tweet is shown as follows:
Sample tweet text
Unicode code point
I want to convert this into utf-8.

When you are writing the tweet data to a file use .decode('utf-8')
And when you try to read data from that file use .encode('utf-8')
Here I am posting an example:
# -*- coding: utf-8 -*-
string1 = "آکاش کمار"
string2 = string1.decode('utf-8')
string3 = string2.encode('utf-8')
print(string3)

Related

How to decode this "%E3%83%9C" string in python? [duplicate]

This question already has answers here:
Url decode UTF-8 in Python
(5 answers)
Closed 6 months ago.
So I have the following string
"%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
It actually means this
ボドカさん
This string seems to be encoded in UTF-8 because when I write this in python
encoded_str = b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
print(encoded_str)
print(encoded_str.decode('utf-8'))
Here is the output I get
b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
ボドカさん
But now I would like a script that will allow me to decode any string in the initial format and here is my code.
import re
import os
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
mystr = mystr.lower()
mystr = re.sub('%', r'\\x', mystr)
encoded_str = bytes(mystr, "utf-8")
print(mystr)
print(encoded_str)
print(encoded_str.decode('utf-8'))
Output:
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
b'\\xe3\\x83\\x9c\\xe3\\x83\\x89\\xe3\\x82\\xab\\xe3\\x81\\x95\\xe3\\x82\\x93'
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
I tried so many possibilities but I couldn't find the right way to encode proprely my string like the b'STRING' thing would do. I always get extra \ characters from the encoding process that then spoil the decoding process too.
I tried all the encoding methods existing in python for the bytes() function.
I need help please. Thank you.
Stack overflow banned me for that question lol
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
encoded_str = bytes.fromhex(mystr.replace('%', ''))
print(encoded_str.decode('utf-8'))
Output:
ボドカさん

Python Requests Reading text [duplicate]

This question already has an answer here:
How to remove Byte Order Mark in python
(1 answer)
Closed 7 months ago.
I am trying to learn text processing. And using nltk.
Trying to follow the NLTK book.
When I try to read a text, it is reading it a little different.
import requests
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = requests.get(url)
response.text[:25]
How can I read the text without the highlighted part in the image uploaded.
This is a unicode format that you're seeing here.
What you should do is, convert the unicode string to ascii with ignore if not ascii.
Example:
a=u'\uffefHello World'
print(a.encode('ascii', 'ignore'))
"Hello World"
The simple answer is to print it and not put it just in the shell:
print(response.text[:25])
Should print:
The Project Gutenberg E8
The shell does repr on the value to find out what it should print
print(repr(response.text[25]))
will again print:
'\ufeffThe Project Gutenberg E8'

Declaring encoding in Python [duplicate]

This question already has answers here:
Working with UTF-8 encoding in Python source [duplicate]
(2 answers)
Closed 8 years ago.
I want to split a string in python using this code:
means="a ، b ، c"
lst=means.split("،")
but I get this error message:
SyntaxError: Non-ASCII character '\xd8' in file dict.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How do I declare an encoding?
Put:
# -*- coding: UTF-8 -*-
as the first line of the file (or second line if using *nix) and save the file as UTF-8.
If you're using Python 2, use Unicode string literals (u"..."), for example:
means = u"a ، b ، c"
lst = means.split(u"،")
If you're using Python 3, string literals are Unicode already (unless marked as bytestrings b"...").
You need to declare an encoding for your file, as documented here and here.

Python JSON loads/dumps breaks Unicode? [duplicate]

This question already has answers here:
Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence
(12 answers)
Closed 7 months ago.
Dumping a string that contains unicode characters as json produces weird unicode escape sequences:
text = "⌂⚘いの法嫁"
print(text) # output: ⌂⚘いの法嫁
import json
json_text = json.dumps(text)
print(json_text) # output: "\u2302\u2698\u3044\u306e\u6cd5\u5ac1"
I'd like to get this output instead:
"⌂⚘いの法嫁"
How can I dump unicode characters as characters instead of escape sequences?
Call json.dumps with ensure_ascii=False:
json_string = json.dumps(json_dict, ensure_ascii=False)
On Python 2, the return value will be unicode instead of str, so you might want to encode it before doing anything else with it.

How do I define strings read from a file as Unicode? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Character reading from file in Python
I want to strip a input string from a file from all special characters, except for actual letters (even Cyrillic letters shouldn't be stripped). The solution I found manually declares the string as unicode and the pattern with the re.UNICODE flag so actual letters from different languages are detected.
# -*- coding: utf-8 -*-
import re
pattern = re.compile("[^\w\d]",re.UNICODE)
n_uni = 'ähm whatßs äüöp ×äØü'
uni = u'ähm whatßs äüöp ×äØü'
words = pattern.split(n_uni) #doesn't work
u_words = pattern.split(uni) #works
So if I write the string directly in the source and manually define it as Unicode it gives me the desired output while the non-Unicode string gives me just garbage:
"ähm whatßs äüöp äØü" -> unicode
"hm what s ü p ü" -> non-unicode even with some invalid characters
My question is now how do I define the input from a file as Unicode?
My question is now how do I define the input from a file as unicode?
Straight from the docs.
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)

Categories