Cannot write German characters scraped from XPath to CSV file - python

I am trying to write information containing German umlaut characters into a CSV. When I write only the first parameter, "name", it comes out correctly. If I write "name" and "institution" though, I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u0308' in position 71: character maps to <undefined>
As you can see in the code below, I tried encoding and decoding the text using different combinations of characters.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
# this is the header of the csv
with open('/filepath/result.csv', 'w', encoding='utf-8') as f:
f.write("name, institution, \n")
l = list(range(1148, 1153))
for i in l:
url = 'webaddress.com' + str(i)
driver.get(url)
name = driver.find_elements_by_xpath('//div[#style="width:600px; display:inline-block;"]')[0].text
name = '\"' + name + '\"'
institution = driver.find_elements_by_xpath('//div[#style="width:600px; display:inline-block;"]')[1].text
institution = '\"' + institution + '\"'
print(str(i) + ': ' + name, '\n', str(i) + ': ' + institution, '\n')
print(institution.encode('utf-8'))
print(institution.encode('utf-8').decode('utf-8'))
print(institution.encode('utf-8').decode('ISO-8859-15'))
with open('/filepath/result.csv', 'a', encoding='utf-8') as f:
f.write(name + ',' + institution + '\n')
driver.close()
The results that show up in the CSV when I set all encodings to UTF-8 look like the one where I encode UTF-8 and decode ISO-8859-15 (latin1). I got the same error as above when I encoded latin1 and decoded utf-8.
Thank you for your help.

You seem to be confused about the purpose of encode. Why would you print(institution.encode('utf-8').decode('utf-8')); this is simply equivalent to print(institution)!
I'm guessing your traceback comes from one of the prints rather than the write(). Try taking out the offending one(s); or simply figure out how to print Unicode to your console and then just do that.
Probably read Ned Batchelder's Pragmatic Unicode.

Add the line at the top of your foo.py file as:
# -*- coding: UTF-8 -*-
As an alternative you can use the io module as follows:
import io
# this is the header of the csv
with io.open('/filepath/result.csv', 'w', encoding='utf-8') as f:
f.write("name, institution, \n")
and later:
with io.open('/filepath/result.csv', 'a', encoding='utf-8') as f:
f.write((name + ',' + institution + '\n')..encode("utf-8"))

Related

Error when trying to write unicode results from web scraping to CSV

I'm using a web scraping script (found on GitHub), and write the results to a .csv file. Some of the results (user reviews) are written in Japanese or Russian, therefore I would like to write unicode to the .csv file.
The code works fine when I just use the csv module but this doesn't write unicode to csv.
This is part of the code I'm using for the web scraping:
with open(datafile, 'w', newline='', encoding='utf8') as csvfile:
# Tab delimited to allow for special characters
datawriter = csv.writer(csvfile, delimiter=',')
print('Processing..')
for i in range(1,pages+1):
# Sleep if throttle enabled
if(throttle): time.sleep(sleepTime)
page = requests.get(reviewPage + '&page=' + str(i))
tree = html.fromstring(page.content)
# Each item below scrapes a pages review titles, bodies, ratings and languages.
titles = tree.xpath('//a[#class="review-title-link"]')
bodies = tree.xpath('//div[#class="review-body"]')
ratings = tree.xpath('//div[#data-status]')
langs = tree.xpath("//h3[starts-with(#class, 'review-title')]")
dates = tree.xpath("//time[#datetime]")
for idx,e in enumerate(bodies):
# Title of comment
title = titles[idx].text_content()
# Body of comment
body = e.text_content().strip()
# The rating is the 5th from last element
rating = ratings[idx].get('data-status').split(' ')[-5]
# Language is 2nd element of h3 tag
lang = langs[idx].get('class').split(' ')[1]
#Date
date = dates[idx].get("datetime").split('T')[0]
datawriter.writerow([title,body,rating,lang,date])
print('Processed ' + str(ratingCount) + '/' + str(ratingCount) + ' ratings.. Finished!')
I've tried to import unicodecsv as csv but this raised a TypeError:
TypeError Traceback (most recent call last)
<ipython-input-4-2db937260285> in <module>()
44 date = dates[idx].get("datetime").split('T')[0]
45
---> 46 datawriter.writerow([title,body,rating,lang,date])
47 print('Processed ' + str(ratingCount) + '/' + str(ratingCount) + ' ratings.. Finished!')
~\lib\site-packages\unicodecsv\py3.py in writerow(self, row)
26
27 def writerow(self, row):
---> 28 return self.writer.writerow(row)
29
30 def writerows(self, rows):
C:\Users\Ebel\Anaconda3\lib\site-packages\unicodecsv\py3.py in write(self, string)
13
14 def write(self, string):
---> 15 return self.binary.write(string.encode(self.encoding, self.errors))
16
17
TypeError: write() argument must be str, not bytes
I would like to have a solution for this problem. Thanks in advance!
As unicodecsv is writing bytes and not string, you want to open() your file in binary mode. Note that the binary mode does not need encoding, you have to remove the encoding parameter.
with open(datafile, 'w', newline='', encoding='utf8') as csvfile:
Then become:
with open(datafile, 'wb', newline='') as csvfile:
The b in 'wb' means you want to write bytes and not strings.
Making comment an answer.
Your with is correct for Python 3, and unicodecsv is only needed for Python 2. Just import csv (use the built-in one). On Windows, Use encoding='utf-8-sig'. Windows Notepad won't display a UTF-8 file correctly without a BOM signature, and Excel won't read it correctly as well.

Python CSV write to file unreadable in Excel (Chinese characters)

I am trying to performing text analysis on Chinese texts. The program is provided below. I got the result with unreadable characters such as 浜烘皯鏃ユ姤绀捐. And if I change the output file result.csv to result.txt, the characters are correct as 人民日报社论. So what's wrong with this? I can not figure out. I tried several ways including add decoder and encoder.
# -*- coding: utf-8 -*-
import os
import glob
import jieba
import jieba.analyse
import csv
import codecs
segList = []
raw_data_path = 'monthly_raw_data/'
file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]
jieba.load_userdict("customized_dict.txt")
for name in file_name:
all_text = ""
multi_line_text = ""
with open(raw_data_path + name + ".txt", "r") as file:
for line in file:
if line != '\n':
multi_line_text += line
templist = multi_line_text.split('\n')
for text in templist:
all_text += text
seg_list = jieba.cut(all_text,cut_all=False)
temp_text = []
for item in seg_list:
temp_text.append(item.encode('utf-8'))
stop_list = []
with open("stopwords.txt", "r") as stoplistfile:
for item in stoplistfile:
stop_list.append(item.rstrip('\r\n'))
text_without_stopwords = []
for word in temp_text:
if word not in stop_list:
text_without_stopwords.append(word)
segList.append(text_without_stopwords)
with open("results/result.csv", 'wb') as f:
writer = csv.writer(f)
writer.writerows(segList)
For UTF-8 encoding, Excel requires a BOM (byte order mark) codepoint written at the start of the file or it will assume an ANSI encoding, which is locale-dependent. U+FEFF is the Unicode BOM. Here's an example that will open in Excel correctly:
#!python2
#coding:utf8
import csv
data = [[u'American', u'美国人'],
[u'Chinese', u'中国人']]
with open('results.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8'))
w = csv.writer(f)
for row in data:
w.writerow([item.encode('utf8') for item in row])
Python 3 makes this easier. Use 'w', newline='', encoding='utf-8-sig' parameters instead of 'wb' which will accept Unicode strings directly and automatically write a BOM:
#!python3
#coding:utf8
import csv
data = [['American', '美国人'],
['Chinese', '中国人']]
with open('results.csv', 'w', newline='', encoding='utf-8-sig') as f:
w = csv.writer(f)
w.writerows(data)
There is also a 3rd–party unicodecsv module that makes Python 2 easier to use as well:
#!python2
#coding:utf8
import unicodecsv
data = [[u'American', u'美国人'],
[u'Chinese', u'中国人']]
with open('results.csv', 'wb') as f:
w = unicodecsv.writer(f ,encoding='utf-8-sig')
w.writerows(data)
Here is another way kinda tricky:
#!python2
#coding:utf8
import csv
data = [[u'American',u'美国人'],
[u'Chinese',u'中国人']]
with open('results.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8'))
w = csv.writer(f)
for row in data:
w.writerow([item.encode('utf8') for item in row])
This code block generate csv file encoded utf-8 .
open file with notepad++ (or other Editor with encode feature)
Encoding -> convert to ANSI
save
Open file with Excel, it's OK.

pycharm console unicode to readable string

studying python with this tutorial
The problem is when i trying to get cyrillic characters i get unicode in pycharm console.
import requests
from bs4 import BeautifulSoup
import operator
import codecs
def start(url):
word_list = []
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for post_text in soup.findAll('a', {'class': 'b-tasks__item__title js-set-visited'}):
content = post_text.string
words = content.lower().split()
for each_word in words:
word_list.append(each_word)
clean_up_list(word_list)
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "!##$%^&*()_+{}|:<>?,./;'[]\=-\""
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
create_dictionary(clean_word_list)
def create_dictionary(clean_word_list):
word_count = {}
for word in clean_word_list:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
for key, value in sorted(word_count.items(), key=operator.itemgetter(1)):
print(key, value)
When i am changing print(key, value) to print(key.decode('utf8'), value) i am getting "UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)"
start('https://youdo.com/tasks-all-opened-all-moscow-1')
There is some suggestion on the internet about changing encoding in some files - don't really get it. Can't i read it in console?
OSX
UPD
key.encode("utf-8")
UTF-8 is sometimes painful. I created a file with a line in Latin caracters and another one with Russian ones. The following code:
# encoding: utf-8
with open("testing.txt", "r", encoding='utf-8') as f:
line = f.read()
print(line)
outputs in PyCharm
Note the two encoding entries
Since you are getting data from a web page, you must make sure that you use the right encoding as well. The following code
# encoding: utf-8
r = requests.get('http://www.pravda.ru/')
r.encoding = 'utf-8'
print(r.text)
outputs in PyCharm as
Please note that you must specifically set the encoding to match the one of the page.

Python JSON to CSV - bad encoding, UnicodeDecodeError: 'charmap' codec can't decode byte

I have a problem converting nested JSON to CSV. For this i use https://github.com/vinay20045/json-to-csv (forked a bit to support python 3.4), here is full json-to-csv.py file.
Converting is working, if i set
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('utf8','ignore')
and
fp = open(json_file_path, 'r', encoding='utf-8')
but when i import csv to MS Excel i see bad cyrillic characters, for example \xe0\xf1 , english text is ok.
Experimented with setting encode('cp1251','ignore') but then i got an error
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to (as here UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>)
import sys
import json
import csv
##
# This function converts an item like
# {
# "item_1":"value_11",
# "item_2":"value_12",
# "item_3":"value_13",
# "item_4":["sub_value_14", "sub_value_15"],
# "item_5":{
# "sub_item_1":"sub_item_value_11",
# "sub_item_2":["sub_item_value_12", "sub_item_value_13"]
# }
# }
# To
# {
# "node_item_1":"value_11",
# "node_item_2":"value_12",
# "node_item_3":"value_13",
# "node_item_4_0":"sub_value_14",
# "node_item_4_1":"sub_value_15",
# "node_item_5_sub_item_1":"sub_item_value_11",
# "node_item_5_sub_item_2_0":"sub_item_value_12",
# "node_item_5_sub_item_2_0":"sub_item_value_13"
# }
##
def reduce_item(key, value):
global reduced_item
#Reduction Condition 1
if type(value) is list:
i=0
for sub_item in value:
reduce_item(key+'_'+str(i), sub_item)
i=i+1
#Reduction Condition 2
elif type(value) is dict:
sub_keys = value.keys()
for sub_key in sub_keys:
reduce_item(key+'_'+str(sub_key), value[sub_key])
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('cp1251','ignore')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("\nUsage: python json_to_csv.py <node_name> <json_in_file_path> <csv_out_file_path>\n")
else:
#Reading arguments
node = sys.argv[1]
json_file_path = sys.argv[2]
csv_file_path = sys.argv[3]
fp = open(json_file_path, 'r', encoding='cp1251')
json_value = fp.read()
raw_data = json.loads(json_value)
processed_data = []
header = []
for item in raw_data[node]:
reduced_item = {}
reduce_item(node, item)
header += reduced_item.keys()
processed_data.append(reduced_item)
header = list(set(header))
header.sort()
with open(csv_file_path, 'wt+') as f:#wb+ for python 2.7
writer = csv.DictWriter(f, header, quoting=csv.QUOTE_ALL, delimiter=',')
writer.writeheader()
for row in processed_data:
writer.writerow(row)
print("Just completed writing csv file with %d columns" % len(header))
How to convert cyrillic correctly and also i want to skip bad characters?
You need to know cyrylic encoding of which file are you going to open.
For example that is enough in python3:
with open(args.input_file, 'r', encoding="cp866") as input_file:
data = input_file.read()
structure = json.loads(data)
In python3 data variable is automatically utf-8. In python2 there might be problem with feeding input to json.
Also try to print out in python interpreter line and see if symbols are right. Without input file is hard to tell if everything is right. Also are you sure that it is python, not excel related problem? Did you tried to open in notepad++ or similar encodings respecting editors?
Most important thing working with encodings is cheking that input and output is right. I would suggest to look here.
maybe you could use the chardet to detect the file's encoding.
import chardet
File='arq.GeoJson'
enc=chardet.detect(open(File,'rb').read())['encoding']
with open(File,'r', encoding = enc) as f:
data=json.load(f)
f.close()
This avoids 'to kick' the encoding.

UnicodeDecodeError with Django's request.FILES

I have the following code in the view call..
def view(request):
body = u""
for filename, f in request.FILES.items():
body = body + 'Filename: ' + filename + '\n' + f.read() + '\n'
On some cases I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 7470: ordinal not in range(128)
What am I doing wrong? (I am using Django 1.1.)
Thank you.
Django has some utilities that handle this (smart_unicode, force_unicode, smart_str). Generally you just need smart_unicode.
from django.utils.encoding import smart_unicode
def view(request):
body = u""
for filename, f in request.FILES.items():
body = body + 'Filename: ' + filename + '\n' + smart_unicode(f.read()) + '\n'
you are appending f.read() directly to unicode string, without decoding it, if the data you are reading from file is utf-8 encoded use utf-8, else use whatever encoding it is in.
decode it first and then append to body e.g.
data = f.read().decode("utf-8")
body = body + 'Filename: ' + filename + '\n' + data + '\n'
Anurag's answer is correct. However another problem here is you can't for certain know the encoding of the files that users upload. It may be useful to loop over a tuple of the most common ones till you get the correct one:
encodings = ('windows-xxx', 'iso-yyy', 'utf-8',)
for e in encodings:
try:
data = f.read().decode(e)
break
except UnicodeDecodeError:
pass
If you are not in control of the file encoding for files that can be uploaded , you can guess what encoding a file is in using the Universal Encoding Detector module chardet.

Categories