python : error handling Ordered dict with unicode data - python

My script migrates data from MySQL to mongodb. It runs perfectly well when there are no unicode columns included. But throws me below error when OrgLanguages column is added.
mongoImp = dbo.insert_many(odbcArray)
File "/home/lrsa/.local/lib/python2.7/site-packages/pymongo/collection.py", line 711, in insert_many
blk.execute(self.write_concern.document)
File "/home/lrsa/.local/lib/python2.7/site-packages/pymongo/bulk.py", line 493, in execute
return self.execute_command(sock_info, generator, write_concern)
File "/home/lrsa/.local/lib/python2.7/site-packages/pymongo/bulk.py", line 319, in execute_command
run.ops, True, self.collection.codec_options, bwc)
bson.errors.InvalidStringData: strings in documents must be valid UTF-8: 'Portugu\xeas do Brasil, ?????, English, Deutsch, Espa\xf1ol latinoamericano, Polish'
My code:
import MySQLdb, MySQLdb.cursors, sys, pymongo, collections
odbcArray=[]
mongoConStr = '192.168.10.107:36006'
sqlConnect = MySQLdb.connect(host = "54.175.170.187", user = "testuser", passwd = "testuser", db = "testdb", cursorclass=MySQLdb.cursors.DictCursor)
mongoConnect = pymongo.MongoClient(mongoConStr)
sqlCur = sqlConnect.cursor()
sqlCur.execute("SELECT ID,OrgID,OrgLanguages,APILoginID,TransactionKey,SMTPSpeed,TimeZoneName,IsVideoWatched FROM organizations")
dbo = mongoConnect.eaedw.mysqlData
tuples = sqlCur.fetchall()
for tuple in tuples:
odbcArray.append(collections.OrderedDict(tuple))
mongoImp = dbo.insert_many(odbcArray)
sqlCur.close()
mongoConnect.close()
sqlConnect.close()
sys.exit()
Above script migraates data perfectly when tried without OrgLanguages column in the SELECT query.
To overcome this, I have tried to use the OrderedDict() in another way but gives me a different type of error
Changed Code:
for tuple in tuples:
doc = collections.OrderedDict()
doc['oid'] = tuple.OrgID
doc['APILoginID'] = tuple.APILoginID
doc['lang'] = unicode(tuple.OrgLanguages)
odbcArray.append(doc)
mongoImp = dbo.insert_many(odbcArray)
Error Received:
Traceback (most recent call last):
File "pymsql.py", line 19, in <module>
doc['oid'] = tuple.OrgID
AttributeError: 'dict' object has no attribute 'OrgID'

Your MySQL connection is returning characters in a different encoding than UTF-8, which is the encoding that all BSON strings must be in. Try your original code but pass charset='utf8' to MySQLdb.connect.

Related

Getting IndexError: string index out of range while loading file in bigquery

I am trying to load a csv file from my google storage bucket to bigquery.
There is no extra procession, just simple load operation.
but its failing.
below is the code snippet :
def bigquery_commit(element):
from google.cloud import bigquery
PROJECT = 'emerald-990'
source = contact_options.output.get()
client = bigquery.Client(project=PROJECT)
dataset_ref = client.dataset('snow')
table_ref = dataset_ref.table(source)
table = client.get_table(table_ref)
errors = client.insert_rows(table, element)
print ("Errors occurred:", errors)
Error
IndexError: string index out of range [while running 'FlatMap(bigquery_commit)']
main function :
options = PipelineOptions()
p = beam.Pipeline(options=options)
(p
| 'Read from a File' >> beam.io.ReadFromText(contact_options.input, skip_header_lines=0)
| beam.FlatMap(bigquery_commit))
p.run().wait_until_finish()
Now, when i pass a test record directly, it works.
example :
rows_to_insert = [{u'id': 101, u'name': 'tom', u'salary': 99899}]
errors = client.insert_rows(table, rows_to_insert)
Any idea, what I am missing.
An issue that I notice quickly here is that your data coming from ReadFromText has type str, while it seems that client.insert_rows takes a LIST of elements.
You should consider rewriting your code to use native Apache Beam transforms, like so:
(p
| 'Read from a File' >> beam.io.ReadFromText(contact_options.input, skip_header_lines=0)
| beam.Map(lambda x: json.loads(x)) # Parse your JSON strings
| apache_beam.io.gcp.bigquery.WriteToBigQuery(table=table_ref))
Now, I do not recommend the following approach, but if you really need to fix your code, you'd do:
def bigquery_commit(element):
from google.cloud import bigquery
PROJECT = 'emerald-990'
source = contact_options.output.get()
client = bigquery.Client(project=PROJECT)
dataset_ref = client.dataset('snow')
table_ref = dataset_ref.table(source)
table = client.get_table(table_ref)
parsed_element = json.loads(element)
errors = client.insert_rows(table, [parsed_element])
print ("Errors occurred:", errors)
I have experienced the same error message, but under a slightly different context. I could not find the answers for my issue so I thought of leaving a hint here.
The traceback did not help much:
File "/user_code/main.py", line 198, in execute
errors = client.insert_rows(table, row)
File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 2113, in insert_rows
json_rows = [_record_field_to_json(schema, row) for row in rows]
File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 2113, in <listcomp>
json_rows = [_record_field_to_json(schema, row) for row in rows]
File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/_helpers.py", line 415, in _record_field_to_json
record[subname] = _field_to_json(subfield, subvalue)
File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/_helpers.py", line 444, in _field_to_json
return _repeated_field_to_json(field, row_value)
File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/_helpers.py", line 386, in _repeated_field_to_json
values.append(_field_to_json(item_field, item))
File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/_helpers.py", line 447, in _field_to_json
return _record_field_to_json(field.fields, row_value)
File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/_helpers.py", line 414, in _record_field_to_json
subvalue = row_value[subindex]
IndexError: string index out of range
I was running a script that inserted a row to BigQuery in a structure with the correct column order, values, and data types. The script selected the appropriate dataset and table. If the table did not exist, then it would create it, and this was the problem.
During the table's creation, I have provided a certain number of columns in its schema, but shortly after I have added two more columns to its schema instead of directly altering the table structure. Therefore, the order of the columns was correct, the inserted rows matched the data types, but the table was not created with all the columns of the inserted row:
IndexError: string index out of range
The solution was deleting and recreating the table with the desired schema. It worked like a charm!

retrieving list key with unicode characters in

I have a huge list like below, where i'm trying to access each value. However i have a hard time retrieving \xe6ndret, since it has special characters. How can access the column when there is unicode characters in the key?
vejstykker = [{u'navngivenvej_id': u'4fb0b0a2-8be7-4254-90d5-fc1af6eb111c', u'kode': u'0007', u'oprettet': u'2019-04-03T13:30:37.031', u'kommunekode': u'0101', u'navn': u'Dompapvej', u'adresseringsnavn': u'Dompapvej', u'\xe6ndret': None, u'id': u'097ae470-9532-4d67-8cb9-6420c601fc24'}]
i have tried doing something like below:
for vejstykke in vejstykker:
created_at = datetime.datetime.strptime(vejstykke['oprettet'],'%Y-%m-%dT%H:%M:%S.%f');
id = vejstykke['id']
kommunekode = vejstykke['kommunekode']
kode = vejstykke['kode']
navn = vejstykke['navn']
adresseringsnavn = vejstykke['adresseringsnavn']
navngivenvej_id = vejstykke['navngivenvej_id']
changed_at = vejstykke['ændret']
however i get below errror:
Traceback (most recent call last):\n File "", line 28, in run\nKeyError: \'\\xc3\\xa6ndret\'\n'

Why peewee coerces select column to integer

I can't use sqlite function group_concat() in peewee. Here is complete snipet. Somehow peewee want to convert result of group_concat() to integer, while it is string ("1,2"). I can't find the way to suppress it.
from peewee import *
db = SqliteDatabase(':memory:')
class Test(Model):
name = CharField()
score = IntegerField()
class Meta:
database = db
db.create_tables([Test])
Test.create(name='A', score=1).save()
Test.create(name='A', score=2).save()
#select name, group_concat(score) from Test group by name
for t in Test.select(Test.name, fn.group_concat(Test.score)).order_by(Test.name):
pass
It produces following error:
Traceback (most recent call last):
File "C:\Users\u_tem0m\Dropbox\Wrk\sgo\broken.py", line 17, in <module>
for t in Test.select(Test.name, fn.group_concat(Test.score)).order_by(Test.name):
File "C:\Program Files\Python 3.5\lib\site-packages\peewee.py", line 1938, in next
obj = self.qrw.iterate()
File "C:\Program Files\Python 3.5\lib\site-packages\peewee.py", line 1995, in iterate
return self.process_row(row)
File "C:\Program Files\Python 3.5\lib\site-packages\peewee.py", line 2070, in process_row
setattr(instance, column, func(row[i]))
File "C:\Program Files\Python 3.5\lib\site-packages\peewee.py", line 874, in python_value
return value if value is None else self.coerce(value)
ValueError: invalid literal for int() with base 10: '1,2'
Try adding a coerce(False) to your call to group_concat:
query = (Test
.select(Test.name, fn.GROUP_CONCAT(Test.score).coerce(False))
.order_by(Test.name))
for t in query:
pass
Peewee sees that Test.score is an integer field, so whenever a function is called on that column, Peewee will try to convert the result back to an int. The problem is that group_concat returns a string, so we must tell Peewee not to mess with the return value.
Just found what result of fn.group_concat(""+Test.score) don't cast to integer. But I think resulting sql maybe less optimal
SELECT "t1"."name", group_concat(? + "t1"."score") AS allscore FROM "test" AS t1 ORDER BY "t1"."name" ['']
Do anybody knows more elegant way?

Adding Item data to a DynamoDB table using boto does not work

I have been trying to add items to a DynamoDB table using boto, but somehow it doesn't seem to work. I tried using users.Item() and users.put_item but nothing worked. Below is the script that I have in use.
import boto.dynamodb2
import boto.dynamodb2.items
import json
from boto.dynamodb2.fields import HashKey, RangeKey, GlobalAllIndex
from boto.dynamodb2.layer1 import DynamoDBConnection
from boto.dynamodb2.table import Table
from boto.dynamodb2.items import Item
from boto.dynamodb2.types import NUMBER
region = "us-east-1"
con = boto.dynamodb2.connect_to_region(region)
gettables = con.list_tables()
mytable = "my_table"
if mytable not in gettables['TableNames']:
print "The table *%s* is not in the list of tables created. A new table will be created." % req_table
Table.create(req_table,
schema = [HashKey('username'),
RangeKey('ID', data_type = NUMBER)],
throughput = {'read': 1, 'write': 1})
else:
print "The table *%s* exists." % req_table
con2table = Table(req_table,connection=con)
con2table.put_item(data={'username': 'abcd',
'ID': '001',
'logins':'10',
'timeouts':'20'
'daysabsent': '30'
})
I tried this, the table gets created and it is fine. But when I try to put in the items, I get the following error message.
Traceback (most recent call last):
File "/home/ec2-user/DynamoDB_script.py", line 29, in <module>
'daysabsent':'30'
File "/usr/lib/python2.7/dist-packages/boto/dynamodb2/table.py", line 821, in put_item
return item.save(overwrite=overwrite)
File "/usr/lib/python2.7/dist-packages/boto/dynamodb2/items.py", line 455, in save
returned = self.table._put_item(final_data, expects=expects)
File "/usr/lib/python2.7/dist-packages/boto/dynamodb2/table.py", line 835, in _put_item
self.connection.put_item(self.table_name, item_data, **kwargs)
File "/usr/lib/python2.7/dist-packages/boto/dynamodb2/layer1.py", line 1510, in put_item
body=json.dumps(params))
File "/usr/lib/python2.7/dist-packages/boto/dynamodb2/layer1.py", line 2842, in make_request
retry_handler=self._retry_handler)
File "/usr/lib/python2.7/dist-packages/boto/connection.py", line 954, in _mexe
status = retry_handler(response, i, next_sleep)
File "/usr/lib/python2.7/dist-packages/boto/dynamodb2/layer1.py", line 2882, in _retry_handler
response.status, response.reason, data)
boto.dynamodb2.exceptions.ValidationException: ValidationException: 400 Bad Request
{u'message': u'One or more parameter values were invalid: Type mismatch for key version expected: N actual: S', u'__type': u'com.amazon.coral.validate#ValidationException'}
Thank you.
From the error message you are getting, it sounds like you are trying to send string values for an attribute that is defined as numeric in DynamoDB.
The specific issue looks to be related to your Range Key ID which is defined as a numeric value N but you are sending it a string value '001'.
Looks like of of the values you are trying to load has empty value.
I got the same error when I was trying to load this. I got exception when partner_name property was a empty string.
try:
item_old = self.table.get_item(hash_key=term)
except BotoClientError as ex:
# if partner alias does not exist then create a new entry!
if ex.message == "Key does not exist.":
item_old = self.table.new_item(term)
else:
raise ex
item_old['partner_code'] = partner_code
item_old['partner_name'] = partner_name
item_old.put()

Cassandra: 'unicode' does not have the buffer interface

I try to use these both prepared statements in my django app:
READINGS = "SELECT * FROM readings"
READINGS_BY_USER_ID = "SELECT * FROM readings WHERE user_id=?"
I query against the db with:
def get_all(self):
query = self.session.prepare(ps.ALL_READINGS)
all_readings = self.session.execute(query)
return all_readings
def get_all_by_user_id(self, user_id):
query = self.session.prepare(ps.READINGS_BY_USER_ID)
readings = self.session.execute(query, [user_id])
return readings
The first of both works pretty well. But the second gives me:
ERROR 2015-07-08 09:42:56,634 | views::exception_handler 47 | ('Unable to complete the operation against any hosts', {<Host: localhost data1>: TypeError("'unicode' does not have the buffer interface",)})
Can anyone tell me what happened here? I understand, that there must be a unicode string somewhere that does not have a buffer interface. But which string is meant? My prepared statement?
Here is the stacktrace in addition:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/rest_framework/views.py", line 448, in dispatch
response = handler(request, *args, **kwargs)
File "/Users/me/Workspace/project/Readings/views.py", line 36, in get_by_user_id
readings = self.tr_dao.get_all_by_user_id(user_id)
File "/Users/me/Workspace/project/Readings/dao.py", line 22, in get_all_by_user_id
readings = self.session.execute(query, [user_id], timeout=60)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/cassandra/cluster.py", line 1405, in execute
result = future.result(timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/cassandra/cluster.py", line 2967, in result
raise self._final_exception
if you are on python 2 this will probably fix it
def get_all_by_user_id(self, user_id):
query = self.session.prepare(ps.READINGS_BY_USER_ID)
readings = self.session.execute(query, [str(user_id)])
return readings
This is not working because your user_id is of type unicode. You can check it using
type(user_id)
If that is the case you should encode it to string:
str(user_id)
It will solve the issue.

Categories