Unable to Save Arabic Decoded Unicode to CSV File Using Python - python

I am working with a twitter streaming package for python. I am currently using a keyword that is written in unicode to search for tweets containing that word. I am then using python to create a database csv file of the tweets. However, I want to convert the tweets back to Arabic symbols when I save them in the csv.
The errors I am receiving are all similar to "error ondata the ASCII caracters in position ___ are not within the range of 128."
Here is my code:
class listener(StreamListener):
def on_data(self, data):
try:
#print data
tweet = (str((data.split(',"text":"')[1].split('","source')[0]))).encode('utf-8')
now = datetime.now()
tweetsymbols = tweet.encode('utf-8')
print tweetsymbols
saveThis = str(now) + ':::' + tweetsymbols.decode('utf-8')
saveFile = open('rawtwitterdata.csv','a')
saveFile.write(saveThis)
saveFile.write('\n')
saveFile.close()
return True

Excel requires a Unicode BOM character written to the beginning of a UTF-8 file to view it properly. Without it, Excel assumes "ANSI" encoding, which is OS locale-dependent.
This writes a 3-row, 3-column CSV file with Arabic:
#!python2
#coding:utf8
import io
with io.open('arabic.csv','w',encoding='utf-8-sig') as f:
s = u'إعلان يونيو وبالرغم تم. المتحدة'
s = u','.join([s,s,s]) + u'\n'
f.write(s)
f.write(s)
f.write(s)
Output:
For your specific example, just make sure to write a BOM character u'\xfeff' as the first characters of your file, encoded in UTF-8. In the example above, the 'utf-8-sig' codec ensures a BOM is written.
Also consult this answer, which shows how to wrap the csv module to support Unicode, or get the third party unicodecsv module.

Here a snippet to write arabic in text
# coding=utf-8
import codecs
from datetime import datetime
class listener(object):
def on_data(self, tweetsymbols):
# python2
# tweetsymbols is str
# tweet = (str((data.split(',"text":"')[1].split('","source')[0]))).encode('utf-8')
now = datetime.now()
# work with unicode
saveThis = unicode(now) + ':::' + tweetsymbols.decode('utf-8')
try:
saveFile = codecs.open('rawtwitterdata.csv', 'a', encoding="utf8")
saveFile.write(saveThis)
saveFile.write('\n')
finally:
saveFile.close()
return self
listener().on_data("إعلان يونيو وبالرغم تم. المتحدة")
All you must know about encoding https://pythonhosted.org/kitchen/unicode-frustrations.html

Related

How to read binary file to text in python 3.9

I have a .sql files that I want to read into my python session (python 3.9). I'm opening using the file context manager.
with open('file.sql', 'r') as f:
text = f.read()
When I print the text, I still get the binary characters, i.e., \xff\xfe\r\x00\n\x00-\x00-..., etc.
I've tried all the arguments such as 'rb', encoding='utf-8, etc., but the results are still binary text. It should be noted that I've used this very same procedure many times over in my code before and this has not been a problem.
Did something change in python 3.9?
First two bytes \xff\xfe looks like BOM (Byte Order Mark)
and table at Wikipedia page BOM shows that \xff\xfe can means encoding UTF-16-LE
So you could try
with open('file.sql', 'r', encoding='utf-16-le') as f:
EDIT:
There is module chardet which you may also try to use to detect encoding.
import chardet
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
info = chardet.detect(data)
print(info['encoding'])
text = data.decode(info['encoding'])
Usually files don't have BOM but if they have then you may try to detect it using example from unicodebook.readthedocs.io/guess_encoding/check-for-bom-markers
from codecs import BOM_UTF8, BOM_UTF16_BE, BOM_UTF16_LE, BOM_UTF32_BE, BOM_UTF32_LE
BOMS = (
(BOM_UTF8, "UTF-8"),
(BOM_UTF32_BE, "UTF-32-BE"),
(BOM_UTF32_LE, "UTF-32-LE"),
(BOM_UTF16_BE, "UTF-16-BE"),
(BOM_UTF16_LE, "UTF-16-LE"),
)
def check_bom(data):
return [encoding for bom, encoding in BOMS if data.startswith(bom)]
# ---------
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
encoding = check_bom(data)
print(encoding)
if encoding:
text = data.decode(encoding[0])
else:
print('unknown encoding')

How can I convert JSON-encoded data that contains Unicode surrogate pairs to string?

so I am trying to take this data that uses unicode indicators and make it print with emojis. It is currently in a txt. file but I will write to an excel file later. So anyways I am getting an error I am not sure what to do with. This is the text I am reading:
"Thanks #UglyGod \ud83d\ude4f https:\\/\\/t.co\\/8zVVNtv1o6\"
"RT #Rosssen: Multiculti beatdown \ud83d\ude4f https:\\/\\/t.co\\/fhwVkjhFFC\"
And here is my code:
sampleFile= open('tweets.txt', 'r').read()
splitFile=sampleFile.split('\n')
for line in sampleFile:
x=line.encode('utf-8')
print(x.decode('unicode-escape'))
This is the error Message:
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string
Any ideas?
This is how the data was originally generated.
class listener(StreamListener):
def on_data(self, data):
# Check for a field unique to tweets (if missing, return immediately)
if "in_reply_to_status_id" not in data:
return
with open("see_no_evil_monkey.csv", 'a') as saveFile:
try:
saveFile.write(json.dumps(data) + "\n")
except (BaseException, e):
print ("failed on data", str(e))
time.sleep(5)
return True
def on_error(self, status):
print (status)
This is how the data was originally generated... saveFile.write(json.dumps(data) + "\n")
You should use json.loads() instead of .decode('unicode-escape') to read JSON text:
#!/usr/bin/env python3
import json
with open('tweets.txt', encoding='ascii') as file:
for line in file:
text = json.loads(line)
print(text)
Your emoji 🙏 is represented as a surrogate pair, see also here for info about this particular glyph. Python cannot decode surrogates, so you'll need to look at exactly how your tweets.txt file was generated, and try encoding the original tweets, along with the emoji, as UTF-8. This will make reading and processing the text file much easier.

How can I replace Unicode characters with Turkish characters in a text file with Python

I am working on Twitter. I got data from Twitter with Stream API and the result of app is JSON file. I wrote tweets data in a text file and now I see Unicode characters instead of Turkish characters. I don't want to do find/replace in Notepad++ by hand. Is there any automatic option to replace characters by opening txt file, reading all data in file and changing Unicode characters with Turkish characters by Python?
Here are Unicode characters and Turkish characters which I want to replace.
ğ - \u011f
Ğ - \u011e
ı - \u0131
İ - \u0130
ö - \u00f6
Ö - \u00d6
ü - \u00fc
Ü - \u00dc
ş - \u015f
Ş - \u015e
ç - \u00e7
Ç - \u00c7
I tried two different type
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
dosya = open('veri.txt', 'r')
for line in dosya:
match = re.search(line, "\u011f")
if (match):
replace("\u011f", "ğ")
dosya.close()
and:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
f1 = open('veri.txt', 'r')
f2 = open('veri2.txt', 'w')
for line in f1:
f2.write=(line.replace('\u011f', 'ğ'))
f2.write=(line.replace('\u011e', 'Ğ'))
f2.write=(line.replace('\u0131', 'ı'))
f2.write=(line.replace('\u0130', 'İ'))
f2.write=(line.replace('\u00f6', 'ö'))
f2.write=(line.replace('\u00d6', 'Ö'))
f2.write=(line.replace('\u00fc', 'ü'))
f2.write=(line.replace('\u00dc', 'Ü'))
f2.write=(line.replace('\u015f', 'ş'))
f2.write=(line.replace('\u015e', 'Ş'))
f2.write=(line.replace('\u00e7', 'ç'))
f2.write=(line.replace('\u00c7', 'Ç'))
f1.close()
f2.close()
Both of these didn't work.
How can I make it work?
JSON allows both "escaped" and "unescaped" characters. The reason why the Twitter API returns only escaped characters is that it can use the ASCII encoding, which increases interoperability. For Turkish characters you need another encoding. Opening a file with the open function opens a file assuming your current locale encoding, which is probably what your editor expects. If you want the output file to have e.g. the ISO-8859-9 encoding, you can pass encoding='ISO-8859-9' as an additional parameter to the open function.
You can read a file containing a JSON object with the json.load function. This returns a Python object with the escaped characters decoded. Writing it again with json.dump and passing ensure_ascii=False as an argument writes the object back to a file without encoding Turkish characters as escape sequences. An example:
import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
in_as_obj = json.load(inp)
json.dump(in_as_obj, out, ensure_ascii=False)
Your file isn't really a JSON file, but instead a file containing multiple JSON objects. If each JSON object is on its own line, you can try the following:
import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
for line in inp:
if not line.strip():
out.write(line)
continue
in_as_obj = json.loads(line)
json.dump(in_as_obj, out, ensure_ascii=False)
out.write('\n')
But in your case it's probably better to write unescaped JSON to the file in the first place. Try replacing your on_data method by (untested):
def on_data(self, raw_data):
data = json.loads(raw_data)
print(json.dumps(data, ensure_ascii=False))
You can use this method:
# For Turkish Character
translationTable = str.maketrans("ğĞıİöÖüÜşŞçÇ", "gGiIoOuUsScC")
yourText = "Pijamalı Hasta Yağız Şoföre Çabucak Güvendi"
yourText = yourText.translate(translationTable)
print(yourText)

Which encoding to use while reading Excel using xlrd

I am trying to read an Excel file using xlrd to write into txt files. Everything is being written fine except for some rows which has some spanish characters like 'Téd'. I can encode those using latin-1 encoding. However the code then fails for other rows which have a 'â' with unicode u'\u2013'. u'\2013' can't be encoded using latin-1. When using UTF-8 'â' are written out fine but 'Téd' is written as 'Téd' which is not acceptable. How do I correct this.
Code below :
#!/usr/bin/python
import xlrd
import csv
import sys
filePath = sys.argv[1]
with xlrd.open_workbook(filePath) as wb:
shNames = wb.sheet_names()
for shName in shNames:
sh = wb.sheet_by_name(shName)
csvFile = shName + ".csv"
with open(csvFile, 'wb') as f:
c = csv.writer(f)
for row in range(sh.nrows):
sh_row = []
cell = ''
for item in sh.row_values(row):
if isinstance(item, float):
cell=item
else:
cell=item.encode('utf-8')
sh_row.append(cell)
cell=''
c.writerow(sh_row)
print shName + ".csv File Created"
Python's csv module
doesn’t support Unicode input.
You are correctly encoding your input before writing it -- so you don't need codecs. Just open(csvFile, "wb") (the b is important) and pass that object to the writer:
with open(csvFile, "wb") as f:
writer = csv.writer(f)
writer.writerow([entry.encode("utf-8") for entry in row])
Alternatively, unicodecsv is a drop-in replacement for csv that handles encoding.
You are getting é instead of é because you are mistaking UTF-8 encoded text for latin-1. This is probably because you're encoding twice, once as .encode("utf-8") and once as codecs.open.
By the way, the right way to check the type of an xlrd cell is to do cell.ctype == xlrd.ONE_OF_THE_TYPES.

Python: how to convert from Windows 1251 to Unicode?

I'm trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn't work.
#!/usr/bin/env python
import os
import sys
import shutil
def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')
# try to open the file and exit if some IOError occurs
try:
f = open(filename, 'r').read()
except Exception:
sys.exit(1)
# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
try:
# try to decode the file with the first encoding
# from the tuple.
# if it succeeds then it will reach break, so we
# will be out of the loop (something we want on
# success).
# the data variable will hold our decoded text
data = f.decode(enc)
break
except Exception:
# if the first encoding fail, then with the continue
# keyword will start again with the second encoding
# from the tuple an so on.... until it succeeds.
# if for some reason it reaches the last encoding of
# our tuple without success, then exit the program.
if enc == encodings[-1]:
sys.exit(1)
continue
# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)
# and at last convert it to utf-8
f = open(filename, 'w')
try:
f.write(data.encode('utf-8'))
except Exception, e:
print e
finally:
f.close()
How can I do that?
Thank you
import codecs
f = codecs.open(filename, 'r', 'cp1251')
u = f.read() # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u) # and now the contents have been output as UTF-8
Is this what you intend to do?
This is just a guess, since you didn't specify what you mean by "doesn't work".
If the file is being generated properly but appears to contain garbage characters, likely the application you're viewing it with does not recognize that it contains UTF-8. You need to add a BOM to the beginning of the file - the 3 bytes 0xEF,0xBB,0xBF (unencoded).
If you use the codecs module to open the file, it will do the conversion to Unicode for you when you read from the file. E.g.:
import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)
This only makes sense if you're working with the file's data in Python. If you're trying to convert a file from one encoding to another on the filesystem (which is what the script you posted tries to do), you'll have to specify an actual encoding, since you can't write a file in "Unicode".

Categories