UnicodeEncode error in python when uploading CSV file to postgres DB - python

I keep getting a UnicodeEncode error when trying to upload a CSV file into a Postgres DB using Python2.7
First I create the file in CSV format. The file has non latin characters that's why I download it and encoding the second column which it has strings:
writer = csv.writer(response, dialect='excel')
writer.writerow(tuple(corresponding_data[btn]["columns"].split(',')))
for row in rows:
field_1 = row[0]
field_2 = row[1].encode(encoding='UTF-8')
fields = [field_1, field_2]
writer.writerows([fields])
The file is created without errors. When I open it in Excel I see that there are some values like: Dajï¿ï¿
In order to upload the file and save it in a table in Postgres I use the python module called: CSVKit.
This is what I do:
import codecs
f = codecs.open(absolute_base_file, 'rb', encoding='utf-8')
delimiter = ","
no_header_row = False
try:
csv_table = table.Table.from_csv(f, name=table_name_temp, no_header_row=no_header_row, delimiter=delimiter)
Although I specify the encoding I keep getting an error:
<type 'exceptions.UnicodeEncodeError'>
I don't know what else to try here.
EDITED
After checking the values in the DB I see they dont really have any not latin characters but there are values with white spaces that when I save them they get unicoded (the whitespaces).
I think this is what is causing the issue.

You can try using unicodecsv instead of the built-in csv

After all I have flattened the values before I write them into the CSV.
I used the unidecode module as following:
from unidecode import unidecode
for row in rows:
field_1 = row[0]
field_2 = unidecode(row[1]).encode(encoding='UTF-8') # LINE CHANGED
fields = [field_1, field_2]
writer.writerows([fields])
return response
Although not a permanent solution, this solved my issue for now.

Related

python cant read csv file downloaded from azure dev ops (utf-8)

I created an azure dev ops query, and chose 'download results as csv' which gave me a csv file. If I open this csv in vscode, I can see in the bottom right corner it says UTF-8 with BOM
I am trying to write some python function that will read in each value of this csv file. I can not rely parsing text myself and spitting values based on the , comma character, because I will have values that include commas inside them.
If I open my csv in excel, everything is organized perfectly. But if I try to parse the file in python, it reads in every row as a single string separated by commas (bad)
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
print('row=',row)
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(
csv_filename="csv_migration\\ads-test-direct-download.csv",
id_format='ID',
encodingVar='utf-8-sig'
)
console output:
filename: csv_migration\ads-test-direct-download.csv, id_format: ID, encoding: utf-8-sig
row= ['Title,State,Work Item Type,ID,12NC']
row= ['TITLE,WITH COMMAS,To Do,NAME,6034,"value,with,commas"']
done
How can I read this file in python so it separates each value into a list? Instead of this single string
I get the same result with encodingVar='utf-8', should I open my csv in some app like notepadd++ and convert it to utf-16? My code works great for .csv files with utf-16 encoding, it can parse each individual value into a list no problem. why wont this work with a utf-8 DOM csv, even when excel can parse the individual values perfectly fine?
csv file: https://file.io/TXh6uyXKZaug
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
row_as_list = row.split(",") # <-- Gets line as list!
print('row=',row_as_list)
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(
csv_filename="csv_migration\\ads-test-direct-download.csv",
id_format='ID',
encodingVar='utf-8-sig'
)
This snippet splits the line into a list that you can index to get the information out

Reading a text file of dictionaries stored in one line

Question
I have a text file that records metadata of research papers requested with SemanticScholar API. However, when I wrote requested data, I forgot to add "\n" for each individual record. This results in something looks like
{<metadata1>}{<metadata2>}{<metadata3>}...
and this should be if I did add "\n".
{<metadata1>}
{<metadata2>}
{<metadata3>}
...
Now, I would like to read the data. As all the metadata is now stored in one line, I need to do some hacks
First I split the cluttered dicts using "{".
Then I tried to convert the string line back to dict. Note that I do consider line might not be in a proper JSON format.
import json
with open("metadata.json", "r") as f:
for line in f.readline().split("{"):
print(json.loads("{" + line.replace("\'", "\"")))
However, there is still an error message
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
I am wondering what should I do to recover all the metadata I collected?
MWE
Note, in order to get metadata.json file I use, use the following code, it should work out of the box.
import json
import urllib
import requests
baseURL = "https://api.semanticscholar.org/v1/paper/"
paperIDList = ["200794f9b353c1fe3b45c6b57e8ad954944b1e69",
"b407a81019650fe8b0acf7e4f8f18451f9c803d5",
"ff118a6a74d1e522f147a9aaf0df5877fd66e377"]
for paperID in paperIDList:
response = requests.get(urllib.parse.urljoin(baseURL, paperID))
metadata = response.json()
record = dict()
record["title"] = metadata["title"]
record["abstract"] = metadata["abstract"]
record["paperId"] = metadata["paperId"]
record["year"] = metadata["year"]
record["citations"] = [item["paperId"] for item in metadata["citations"] if item["paperId"]]
record["references"] = [item["paperId"] for item in metadata["references"] if item["paperId"]]
with open("metadata.json", "a") as fileObject:
fileObject.write(json.dumps(record))
The problem is that when you do the split("{") you get a first item that is empty, corresponding to the opening {. Just ignore the first element and everything works fine (I added an r in your quote replacements so python considers then as strings literals and replace them properly):
with open("metadata.json", "r") as f:
for line in f.readline().split("{")[1:]:
print(json.loads("{" + line).replace(r"\'", r"\""))
As suggested in the comments, I would actually recommend recreating the file or saving a new version where you replace }{ by }\n{:
with open("metadata.json", "r") as f:
data = f.read()
data_lines = data.replace("}{","}\n{")
with open("metadata_mod.json", "w") as f:
f.write(data_lines)
That way you will have the metadata of a paper per line as you want.

Ignore encoding errors with csv writer and MySQLdb

I've been trying to get a full data from a MySQL table to csv with python and I've done it well but now that table has a column "description" and it's a piece of hell now coz there're encoding issues everywhere.
After trying tons and tons of things readed from other posts now i give up with those characters and i want to skip them directly and avoid those errors.
test.py:
import MySQLdb, csv, codecs
dbConn = MySQLdb.connect(dbServer,dbUser,dbPass,dbName,charset='utf8')
cur = dbConn.cursor()
def createFile(self):
SQLview = 'SELECT fruit_type, fruit_qty, fruit_price, fruit_description FROM fruits'
cur.execute(SQLview)
with codecs.open('fruits.csv','wb',encoding='utf8',errors='ignore') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerows(cur)
Still getting that error from the function:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 27: ordinal not in range(128)
Any idea if i can skip that error and still writting the rest of the data from the DB Query?
PD: The function crash line is:
csv_writer.writerows(cur)
Don't know if that's usefull info for someone
Finally solved
changed:
import csv
to:
import unicodecsv as csv
changed:
csv_writer = csv.writer(csv_file)
to:
csv_writer = csv.writer(csv_file,encoding='utf-8')

nested JSON to CSV using python script

i'm new to python and I've got a large json file that I need to convert to csv - below is a sample
{ "status": "success","Name": "Theresa May","Location": "87654321","AccountCategory": "Business","AccountType": "Current","TicketNo": "12345-12","AvailableBal": "12775.0400","BookBa": "123475.0400","TotalCredit": "1234567","TotalDebit": "0","Usage": "5","Period": "May 11 2014 to Jul 11 2014","Currency": "GBP","Applicants": "Angel","Signatories": [{"Name": "Not Available","BVB":"Not Available"}],"Details": [{"PTransactionDate":"24-Jul-14","PValueDate":"24-Jul-13","PNarration":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"},{"PTransactionDate":"24-Jul-14","PValueDate":"23-Jul-14","PTest":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"},{"PTransactionDate":"25-Jul-14","PValueDate":"22-Jul-14","PTest":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"},{"PTransactionDate":"25-Jul-14","PValueDate":"21-Jul-14","PTest":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"},{"PTransactionDate":"25-Jul-14","PValueDate":"20-Jul-14","PTest":"Cash Deposit","PCredit":"0.0000","PDebit":"40003.0000","PBalance":"40003.0000"}]}
I need this to show up as
name, status, location, accountcategory, accounttype, availablebal, totalcredit, totaldebit, etc as columns,
with the pcredit, pdebit, pbalance, ptransactiondate, pvaluedate and 'ptest' having new values each row as the JSON file shows
I've managed to put this script below together looking online, but it's showing me an empty csv file at the end. What have I done wrong? I have used the online json to csv converters and it works, however as these are sensitive files I'm hoping to write/manage with my own script so I can see exactly how it works. Please see below for my python script - can I have some advise on what to change? thanks
import csv
import json
infile = open("BankStatementJSON1.json","r")
outfile = open("testing.csv","w")
writer = csv.writer(outfile)
for row in json.loads(infile.read()):
writer.writerow(row)
import csv, json, sys
# if you are not using utf-8 files, remove the next line
sys.setdefaultencoding("UTF-8") # set the encode to utf8
# check if you pass the input file and output file
if sys.argv[1] is not None and sys.argv[2] is not None:
fileInput = sys.argv[1]
fileOutput = sys.argv[2]
inputFile = open("BankStatementJSON1.json","r") # open json file
outputFile = open("testing2.csv","w") # load csv file
data = json.load("BankStatementJSON1.json") # load json content
inputFile.close() # close the input file
output = csv.writer("testing.csv") # create a csv.write
output.writerow(data[0].keys()) # header row
for row in data:
output.writerow(row.values()) # values row
This works for the JSON example you posted. The issue is that you have nested dict and you can't create sub-headers and sub rows for pcredit, pdebit, pbalance, ptransactiondate, pvaluedate and ptest as you want.
You can use csv.DictWriter:
import csv
import json
with open("BankStatementJSON1.json", "r") as inputFile: # open json file
data = json.loads(inputFile.read()) # load json content
with open("testing.csv", "w") as outputFile: # open csv file
output = csv.DictWriter(outputFile, data.keys()) # create a writer
output.writeheader()
output.writerow(data)
Make sure you're closing the output file at the end as well.

Read csv from url one line at the time in Python 3.X

I have to read an online csv-file into a postgres database, and in that context I have some problems reading the online csv-file properly.
If I just import the file it reads as bytes, so I have to decode it. During the decoding it, however, seems that the entire file is turned into one long string.
# Libraries
import csv
import urllib.request
# Function for importing csv from url
def csv_import(url):
url_open = urllib.request.urlopen(url)
csvfile = csv.reader(url_open.decode('utf-8'), delimiter=',')
return csvfile;
# Reading file
p_pladser = csv_import("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326")
When I try to read the imported file line by line it only reads one character at the time.
for row in p_pladser:
print(row)
break
['F']
Can you help me identify where it goes wrong? I am using Python 3.6.
EDIT: Per request my solution in R
# Loading library
library(RPostgreSQL)
# Reading dataframe
p_pladser = read.csv("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326", encoding = "UTF-8", stringsAsFactors = FALSE)
# Creating database connection
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "secretdatabase", host = "secrethost", user = "secretuser", password = "secretpassword")
# Uploading dataframe to postgres database
dbWriteTable(con, "p_pladser", p_pladser , append = TRUE, row.names = FALSE, encoding = "UTF-8")
I have to upload several tables for 10,000 to 100,000 rows, and it total in R it takes 1-2 seconds to upload them all.
csv.reader expect as argument a file like object and not a string. You have 2 options here:
either you read the data into a string (as you currently do) and then use a io.StringIO to build a file like object around that string:
def csv_import(url):
url_open = urllib.request.urlopen(url)
csvfile = csv.reader(io.StringIO(url_open.read().decode('utf-8')), delimiter=',')
return csvfile;
or you use a io.TextIOWrapper around the binary stream provided by urllib.request:
def csv_import(url):
url_open = urllib.request.urlopen(url)
csvfile = csv.reader(io.TextIOWrapper(url_open, encoding = 'utf-8'), delimiter=',')
return csvfile;
How about loading the CSV with pandas!
import pandas as pd
csv = pd.read_csv("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326")
print csv.columns
OR if you have the CSV downloaded in your machine, then directly
csv = pd.read_csv("<path_to_csv>")
Ok! You may consider passing delimiter and quotechar arguments to csv.reader, because the CSV contains quotes as well! Something like this,
with open('p_pladser.csv') as f:
rows = csv.reader(f, delimiter=',', quotechar='"')
for row in rows:
print(row)

Categories