Utf-8 decoding with Python - python

I have a csv with some data, and in one row there is a text that was added after encoding it in utf-8.
This is the text:
"b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'"
I'm trying to use this text to obtain the original characters using the decode function, but it's imposible.
Does anyone know which is the correct procedure to do it?

Assuming that the line in your file is exactly like this:
b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
And reading the line from the file gives the output:
>>> line
"b'\\xe7\\x94\\xb3\\xe8\\xbf\\xaa\\xe8\\xa5\\xbf\\xe8\\xb7\\xaf255\\xe5\\xbc\\x84660\\xe5\\x8f\\xb7\\xe5\\x92\\x8c665\\xe5\\x8f\\xb7 \\xe4\\xb8\\xad\\xe5\\x9b\\xbd\\xe4\\xb8\\x8a\\xe6\\xb5\\xb7\\xe6\\xb5\\xa6\\xe4\\xb8\\x9c\\xe6\\x96\\xb0\\xe5\\x8c\\xba 201205'"`
You can try to use eval() function:
with open(r"your_csv.csv", "r") as csvfile:
for line in csvfile:
# when you reach the desired line
b = eval(line).decode('utf-8')
Output:
>>> print(b)
'申迪西路255弄660号和665号 中国上海浦东新区 201205'

Try this:-
a = b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
print(a.decode('utf-8')) #your decoded output
As you are saying you are reading from file then you can try with passing encoding system when reading:-
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)

Related

Problem in special character reading from file in python

I am having a problem in file reading in python. I have a file that has some Unicode characters like below.
Test_data.txt :
ý[þ»¢5åÆ¢Nde¼Èó!`Å6^
But when I am trying to read the file some extra characters get appended with the text like below.
ý[þ»¢5\x1få\x8fÆ\x0f¢Nde¼Èó!\x0c`Å6\x1d\x1a^
My code is below:
main_data_full = []
main_file = open("Test_data.txt", "r", encoding = 'utf-8')
main_data = []
for line in main_file:
main_data_full.extend(line.split("\n"))
print(main_data_full)
I don't want to get the extra "\x" type characters in between the text. Can anyone help me to solve the code.
An opened Python file is iterable, line by line, so you don't need to split lines yourself or use extend().
For example, suppose we have this file:
some data
ý[þ»¢5åÆ¢Nde¼Èó!`Å6^
blah
blah2
A small program:
import sys
with open(sys.argv[1], 'r', encoding = 'utf-8') as fh:
# One way to read the lines.
lines = []
for line in fh:
lines.append(line)
# Another.
# lines = list(fh)
# And another.
# lines = fh.readlines()
print(lines)
Output:
['some data\n', 'ý[þ»¢5åÆ¢Nde¼Èó!`Å6^\n', 'blah\n', 'blah2\n']

Replace all unicode codes with characters in python

I have a text file which looks like this:
l\u00f6yt\u00e4\u00e4
but all unicode chars need to be replaced with corresponding characters and should look like this:
löytää
the problem is that I do not want to replace all unicode codes by myself, what is the most efficient way to do this automatically?
my code looks like this right now but it needs to be refined for sure!(the code is in Python3)
import io
input = io.open("input.json", "r", encoding="utf-8")
output = io.open("output.txt", "w", encoding="utf-8")
with input, output:
# Read input file.
file = input.read()
file = file.replace("\\u00e4", "ä")
# I think last line is the same as line below:
# file = file .replace("\\u00e4", u"\u00e4")
file = file.replace("\\u00c4", "Ä")
file = file.replace("\\u00f6", "ö")
file = file.replace("\\u00d6", "Ö")
.
.
.
# I cannot put all codes in unicode here manually!
.
.
.
# writing output file
output.write(file)
Just decode the JSON as JSON, and then write out a new JSON document without ensuring the data is ASCII safe:
import json
with open("input.json", "r", encoding="utf-8") as input:
with open("output.txt", "w", encoding="utf-8") as output:
document = json.load(input)
json.dump(document, output, ensure_ascii=False)
From the json.dump() documentation:
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
Demo:
>>> import json
>>> print(json.loads(r'"l\u00f6yt\u00e4\u00e4"'))
löytää
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"')))
"l\u00f6yt\u00e4\u00e4"
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"'), ensure_ascii=False))
"löytää"
If you have extremely large documents, you could still process them textually, line by line, but use regular expressions to do the replacements:
import re
unicode_escape = re.compile(
r'(?<!\\)'
r'(?:\\u([dD][89abAB][a-fA-F0-9]{2})\\u([dD][c-fC-F][a-fA-F0-9]{2})'
r'|\\u([a-fA-F0-9]{4}))')
def replace(m):
return bytes.fromhex(''.join(m.groups(''))).decode('utf-16-be')
with open("input.json", "r", encoding="utf-8") as input:
with open("output.txt", "w", encoding="utf-8") as output:
for line in input:
output.write(unicode_escape.sub(replace, line))
This however fails if your JSON has embedded JSON documents in strings or if the escape sequence is preceded by an escaped backslash.

Replace a character by another in a file

I'd like to modify some characters of a file in-place, without having to copy the entire content of the file in another, or overwrite the existing one. However, it doesn't seem possible to just replace a character by another:
>>> f = open("foo", "a+") # file does not exist
>>> f.write("a")
1
>>> f.seek(0)
0
>>> f.write("b")
1
>>> f.seek(0)
0
>>> f.read()
'ab'
Here I'd have expected "a" to be replaced by "b", so that the content of the file would be just "b", but this is not the case. Is there a way to do this?
That's because of the mode you're using, in append mode, the file pointer is moved to the end of file before write, you should open your file in w+ mode:
f = open("foo", "w+") # file does not exist
f.write("samething")
f.seek(1)
f.write("o")
f.seek(0)
print f.read() # prints "something"
If you want to do that on an existing file without truncating it, you should open it in r+ mode for reading and writing.
Truncate the file using file.truncate first:
>>> f = open("foo", "a+")
>>> f.write('a')
>>> f.truncate(0) #truncates the file to 0 bytes
>>> f.write('b')
>>> f.seek(0)
>>> f.read()
'b'
Otherwise open the file in w+mode as suggested by #Guillaume.
import fileinput
for line in fileinput.input('abc', inplace=True):
line = line.replace('t', 'ed')
print line,
This doesn't replace character by character, instead it scans through each line replacing required character and writes the modified line.
For example:
file 'abc' contains:
i want
to replace
character
After executing, output would be:
i waned
edo replace
characeder
Will it help you? Hope so..
I believe that you may be able to modify the example from this answer.
https://stackoverflow.com/a/290494/1669208
import fileinput
for line in fileinput.input("test.txt", inplace=True):
print line.replace(char1, char2),

Open() and codecs.open() in Python 2.7 behave strangely different

I have a text file with first line of unicode characters and all other lines in ASCII.
I try to read the first line as one variable, and all other lines as another. However, when I use the following code:
# -*- coding: utf-8 -*-
import codecs
import os
filename = '1.txt'
f = codecs.open(filename, 'r3', encoding='utf-8')
print f
names_f = f.readline().split(' ')
data_f = f.readlines()
print len(names_f)
print len(data_f)
f.close()
print 'And now for something completely differerent:'
g = open(filename, 'r')
names_g = g.readline().split(' ')
print g
data_g = g.readlines()
print len(names_g)
print len(data_g)
g.close()
I get the following output:
<open file '1.txt', mode 'rb' at 0x01235230>
28
7
And now for something completely differerent:
<open file '1.txt', mode 'r' at 0x017875A0>
28
77
If I don't use readlines(), whole file reads, not only first 7 lines both at codecs.open() and open().
Why does such thing happen?
And why does codecs.open() read file in binary mode, despite the 'r' parameter is added?
Upd: This is original file: http://www1.datafilehost.com/d/0792d687
Because you used .readline() first, the codecs.open() file has filled a linebuffer; the subsequent call to .readlines() returns only the buffered lines.
If you call .readlines() again, the rest of the lines are returned:
>>> f = codecs.open(filename, 'r3', encoding='utf-8')
>>> line = f.readline()
>>> len(f.readlines())
7
>>> len(f.readlines())
71
The work-around is to not mix .readline() and .readlines():
f = codecs.open(filename, 'r3', encoding='utf-8')
data_f = f.readlines()
names_f = data_f.pop(0).split(' ') # take the first line.
This behaviour is really a bug; the Python devs are aware of it, see issue 8260.
The other option is to use io.open() instead of codecs.open(); the io library is what Python 3 uses to implement the built-in open() function and is a lot more robust and versatile than the codecs module.

How to parse a single line csv string without the csv.reader iterator in python?

I have a CSV file that i need to rearrange and renecode. I'd like to run
line = line.decode('windows-1250').encode('utf-8')
on each line before it's parsed and split by the CSV reader. Or I'd like iterate over lines myself run the re-encoding and use just single line parsing form CSV library but with the same reader instance.
Is there a way to do that nicely?
Loop over lines on file can be done this way:
with open('path/to/my/file.csv', 'r') as f:
for line in f:
puts line # here You can convert encoding and save lines
But if You want to convert encoding of a whole file You can also call:
$ iconv -f Windows-1250 -t UTF8 < file.csv > file.csv
Edit: So where the problem is?
with open('path/to/my/file.csv', 'r') as f:
for line in f:
line = line.decode('windows-1250').encode('utf-8')
elements = line.split(",")
Thx, for the answers. The wrapping one gave me an idea:
def reencode(file):
for line in file:
yield line.decode('windows-1250').encode('utf-8')
csv_writer = csv.writer(open(outfilepath,'w'), delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
csv_reader = csv.reader(reencode(open(filepath)), delimiter=";",quotechar='"')
for c in csv_reader:
l = # rearange columns here
csv_writer.writerow(l)
That's exactly what i was going for re-encoding a line just before it's get parsed by the csv_reader.
At the very bottom of the csv documentation is a set of classes (UnicodeReader and UnicodeWriter) that implements Unicode support for csv:
rfile = open('input.csv')
wfile = open('output.csv','w')
csv_reader = UnicodeReader(rfile,encoding='windows-1250')
csv_writer = UnicodeWriter(wfile,encoding='utf-8')
for c in csv_reader:
# process Unicode lines
csv_writer.writerow(c)
rfile.close()
wfile.close()

Categories