Elasticsearch fails in parsing datetime field coming from pymongo as object - python

I am trying to stream data from a mongoDB to Elasticsearch using both pymongo and the Python client elasticsearch.
I have set a mapping, I report here the snippet related to the field of interest:
"updated_at": {
"type": "date",
"format": "dateOptionalTime"
}
My script grabs each document from the MongoDB using pymongo and tries indexing it into Elasticsearch as
from elasticsearch import Elasticsearch
from pymongo import MongoClient
mongo_client = MongoClient('localhost', 27017)
es_client = Elasticsearch(hosts=[{"host": "localhost", "port": 9200}])
db = mongo_client['my_db']
collection = db['my_collection']
for doc in collection.find():
es_client.index(
index='index_name',
doc_type='my_type',
id=str(doc['_id']),
body=json.dumps(doc, default=json_util.default)
)
The problem I have in running it is:
elasticsearch.exceptions.RequestError: TransportError(400, u'MapperParsingException[failed to parse [updated_at]]; nested: ElasticsearchIllegalArgumentException[unknown property [$date]]; ')
I believe the source of the problem is in the fact that pymongo serializes the field updated_at as a datetime.datetime object, as I can see if I print the doc in the for loop:
u'updated_at': datetime.datetime(2014, 8, 31, 17, 18, 13, 17000)
This conflicts with Elasticsearch looking for an object of type date as specified in the mapping.
Any ideas how to solve this?

You're on the right path, your Python datetime needs to be serialized as an ISO 8601-compliant date string. So, you need to add a CustomEncoder in your json.dumps() call. First, declare your CustomEncoder as a subclass of JSONEncoder which will handle the transformation of datetime and time properties, but delegate the rest to its superclass:
class CustomEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime):
return obj.strftime('%Y-%m-%dT%H:%M:%S%z')
if isinstance(obj, time):
return obj.strftime('%H:%M:%S')
if hasattr(obj, 'to_json'):
return obj.to_json()
return super(CustomEncoder, self).default(obj)
And then you can use it in your json.dumps call, like this:
...
body=json.dumps(doc, default=json_util.default, cls=CustomEncoder)
...

I guess your problem is that you're using
body=json.dumps(doc, default=json_util.default)
but you should be using
body=doc
Doing that works for me, since it seems elasticsearch is caring for the aliasing of the dictionarly into a JSON document (of course, assuming doc is a dictionary, which I guess it is).
At least in the version of elasticsearch I'm using (2.x), datetime.datetime is correctly aliased, with no need of a mapping. For example, this works for me:
doc = {"updated_on": datetime.now(timezone.utc)}
res = es.index(index=es_index, doc_type='my_type',
id=1, body=doc)
And is recognized by Kibana as a date.

You can use:
from elasticsearch_dsl.serializer import serializer
serializer.dumps(your_dict)
Replace your_dict with your Document().prepare() or document.to_dict()

Making sure I timestamp to elastic using datetime.now(timezone.utc)
from datetime import datetime, timezone
doc = {
"timestamp": datetime.now(timezone.utc),
#the rest of your data
}
Solved the problem of the time having a strange drift on elastic search.

Related

How to Convert DateTime to String in Django/DRF While Testing

I'm testing an endpoint which naturally returns a JSON containing the datetime as a string.
I compare the response content in test as such:
assert serializer_instance.data == {
"created_at": str(model_instance.created_at),
"updated_at": str(model_instance.updated_at),
}
created_at and updated_at are surely DateTimeFields. However, in this case, test fails saying:
E Differing items:
E {'created_at': '2020-06-24T12:42:03.578207+03:00'} != {'created_at': '2020-06-24 09:42:03.578207+00:00'}
E {'updated_at': '2020-06-24T12:42:03.578231+03:00'} != {'updated_at': '2020-06-24 09:42:03.578231+00:00'}
So str uses a different formatting on datetimes. Sure, the test case can be passed successfully using strftime, but there should be an internal function that does it easily in either Django or Django Rest Framework and I'd like to learn it.
Thanks in advance.
Environment
Python 3.8.3
Django 2.2.12
Django Rest Framework 3.11.0
I am a bit late to the party, But, better late than never!
I am using this method to assert datetime response in DRF
from rest_framework.fields import DateTimeField
drf_str_datetime = DateTimeField().to_representation
assert serializer_instance.data == {
"created_at": drf_str_datetime(model_instance.created_at),
"updated_at": drf_str_datetime(model_instance.updated_at),
}
For DRF I normally use
obj.ts_updated.astimezone(timezone(settings.TIME_ZONE)).isoformat()
This matches the DRF format.
I've found a way. It uses parse_datetime method and, instead of converting DateTimeField fields on model instance with str, I thought it's better both stay as datetime.
from django.utils.dateparse import parse_datetime
data = serializer_instance.data
data["created_at"] = parse_datetime(data["created_at"])
# ... and the others ...
assert data == {
# ... and the others ...
"created_at": model_instance.created_at,
# ... and the others ...
}
While this is okay, we mutate serializer_instance.data like this. I don't think it is going to be a problem in tests though.
you can use:
myDate.strftime('%m/%d/%Y')
or
'{:%m/%d/%Y}'.format(myDate)
For Django Rest Framework 3.11.0 you can use the following helper function to convert a Python datetime object into a string representation used by DRF:
from pytz import timezone as pytz_timezone
def convert_datetime_to_drf_str(date_time: datetime) -> str:
return date_time.astimezone(pytz_timezone(settings.TIME_ZONE)).isoformat().replace("+00:00", "Z")
So, for your specific case it would be:
assert serializer_instance.data == {
"created_at": convert_datetime_to_drf_str(model_instance.created_at),
"updated_at": convert_datetime_to_drf_str(model_instance.updated_at),
}
or directly without the helper function:
from pytz import timezone as pytz_timezone
assert serializer_instance.data == {
"created_at": model_instance.created_at.astimezone(pytz_timezone(settings.TIME_ZONE)).isoformat().replace("+00:00", "Z"),
"updated_at": model_instance.updated_at.astimezone(pytz_timezone(settings.TIME_ZONE)).isoformat().replace("+00:00", "Z"),
}

How to compare sql vs json in python

I have the following problem.
I have a class User simplified example:
class User:
def __init__(self, name, lastname, status, id=None):
self.id = id
self.name = name
self.lastname = lastname
self.status = status
def set_status(self,status)
# call to the api to change status
def get_data_from_db_by_id(self)
# select data from db where id = self.id
def __eq__(self, other):
if not isinstance(other, User):
return NotImplemented
return (self.id, self.name, self.lastname, self.status) == \
(other.id, other.name, other.lastname, other.status)
And I have a database structure like:
id, name, lastname, status
1, Alex, Brown, free
And json response from an API:
{
"id": 1,
"name": "Alex",
"lastname": "Brown",
"status": "Sleeping"
}
My question is:
What the best way to compare json vs sql responses?
What for? - it's only for testing purposes - I have to check that API has changed the DB correctly.
How can I deserialize Json and DB resul to the same class? Is there any common /best practices ?
For now, I'm trying to use marshmallow for json and sqlalchemy for DB, but have no luck with it.
Convert the database row to a dictionary:
def row2dict(row):
d = {}
for column in row.__table__.columns:
d[column.name] = str(getattr(row, column.name))
return d
Then convert json string to a dictionary:
d2 = json.loads(json_response)
And finally compare:
d2 == d
If you are using SQLAlchemy for the database, then I would recommend using SQLAthanor (full disclosure: I am the library’s author).
SQLAthanor is a serialization and de-serialization library for SQLAlchemy that lets you configure robust rules for how to serialize / de-serialize your model instances to JSON. One way of checking your instance and JSON for equivalence is to execute the following logic in your Python code:
First, serialize your DB instance to JSON. Using SQLAthanor you can do that as simply as:
instance_as_json = my_instance.dump_to_json()
This will take your instance and dump all of its attributes to a JSON string. If you want more fine-grained control over which model attributes end up on your JSON, you can also use my_instance.to_json() which respects the configuration rules applied to your model.
Once you have your serialized JSON string, you can use the Validator-Collection to convert your JSON strings to dicts, and then check if your instance dict (from your instance JSON string) is equivalent to the JSON from the API (full disclosure: I’m also the author of the Validator-Collection library):
from validator_collection import checkers, validators
api_json_as_dict = validators.dict(api_json_as_string)
instance_json_as_dict = validators.dict(instance_as_json)
are_equivalent = checkers.are_dicts_equivalent(instance_json_as_dict, api_json_as_dict)
Depending on your specific situation and objectives, you can construct even more elaborate checks and validations as well, using SQLAthanor’s rich serialization and deserialization options.
Here are some links that you might find helpful:
SQLAthanor Documentation on ReadTheDocs
SQLAthanor on Github
.dump_to_json() documentation
.to_json() documentation
Validator-Collection Documentation
validators.dict() documentation
checkers.are_dicts_equivalent() documentation
Hope this helps!

Flask-SQLAlchemy ORM/GeoAlchemy2 results to a dictionary and ultimately JSON

I am using Flask/SQLAlchemy to create a web app with a map in it, so naturally I'm using a PostGIS database. The geom column requires an ST_Transform and somehow I need to turn this column and all others into JSON. The general structure of the database is:
from app import login, db
from datetime import datetime
from geoalchemy2 import Geometry
from time import time
from flask import current_app
from sqlalchemy import func
class Streets(db.Model):
id = db.Column(db.Integer, primary_key=True)
street = db.Column(db.String(50))
geom = db.Column(Geometry(geometry_type='LINESTRING'))
def to_dict(self):
data = {
'id': self.id,
'street': self.street,
'_geom': func.ST_AsGeoJSON(func.ST_Transform(self.geom, 4326))
}
return data
My api route turns this result into an api:
return jsonify(Streets.query.get_or_404(id).to_dict())
But I keep getting this error: NameError: name 'ST_AsGeoJSON' is not defined
I also tried to create my _geom value like this:
data['_geom'] = db.session.query(func.ST_AsGeoJSON(func.ST_Transform(self.geom, 4326)))
The error message is: TypeError: Object of type 'BaseQuery' is not JSON serializable
Finally, I tried an api route like this:
data = Streets.to_dict(
db.session.query(
func.ST_AsGeoJSON(
func.ST_Transform(
Streets.geom, 4326
)
)
)
.filter(Streets.id==id))
return jsonify(data)
And I get a different error:
AttributeError: 'BaseQuery' object has no attribute 'id'
If I run this in flask shell it works:
streets = db.session.query(
Streets.id,
Streets.street,
func.ST_AsGeoJSON(func.ST_Transform(Streets.geom, 4326)))
How can I perform ST_Transform and get JSON to my api route?
UPDATE
I found this in the SQLALchemy documentation that got me some progress: "orm.column_property() can be used to map a SQL expression". So I tried adding this to my class Streets(db.Model):
coords = db.column_property(func.ST_AsGeoJSON(func.ST_Transform(geom, 4326)))
Then I add it to data like this:
def to_dict(self):
data = {
'id': self.id,
'street': self.street,
'coords': self.coords
}
return data
But now I'm double encoding my results, once into GeoJSON and then I jsonify it:
return jsonify(Streets.query.get_or_404(id).to_dict())
So my api inserts \'s:
{"coords": "{\"type\":\"MultiLineString\",\"coordinates\":[[[-80.8357132798193,35.2260689001034],[-80.8347602582754,35.2252424284259]]]}"}
And using ST_AsText just turns it into text:
{"coords": "MULTILINESTRING((-80.8357132798193 35.2260689001034,-80.8347602582754 35.2252424284259))"}
I think I'm close with this update, but does anyone have a suggestion for getting correct GeoJSON with the JSON of the other fields of my database?
The first error
NameError: name 'ST_AsGeoJSON' is not defined
means that your example code is not what you were actually using. You had forgot to access it through func. It would not work after fixing that either, since you'd be mixing the SQL world and the Python world. func.ST_AsGeoJSON(...) creates an SQL function expression object that is supposed to be compiled to SQL and sent to the DB in a query, not passed to jsonify().
The second error
TypeError: Object of type 'BaseQuery' is not JSON serializable
should be somewhat obvious.
data['_geom'] = db.session.query(func.ST_AsGeoJSON(func.ST_Transform(self.geom, 4326)))
creates a Query, and a too broad query at that, since you've not limited it to fetch data of the current object. The Query object is not JSON serializable.
In
data = Streets.to_dict(db.session.query(...)...)
you pass the Query object as self to Streets.to_dict(), which then tries to access its id attribute in
'id': self.id,
which fails for obvious reasons – namely passing an unrelated object as the instance to a method.
The column_property() approach produces the doubly encoded JSON because SQLAlchemy does not by default expect ST_AsGeoJSON to return JSON and treats it as text instead, which it actually returns. Try decoding in between manually:
def to_dict(self):
data = {
'id': self.id,
'street': self.street,
'coords': json.loads(self.coords)
}
return data

TypeError: ObjectId('') is not JSON serializable

My response back from MongoDB after querying an aggregated function on document using Python, It returns valid response and i can print it but can not return it.
Error:
TypeError: ObjectId('51948e86c25f4b1d1c0d303c') is not JSON serializable
Print:
{'result': [{'_id': ObjectId('51948e86c25f4b1d1c0d303c'), 'api_calls_with_key': 4, 'api_calls_per_day': 0.375, 'api_calls_total': 6, 'api_calls_without_key': 2}], 'ok': 1.0}
But When i try to return:
TypeError: ObjectId('51948e86c25f4b1d1c0d303c') is not JSON serializable
It is RESTfull call:
#appv1.route('/v1/analytics')
def get_api_analytics():
# get handle to collections in MongoDB
statistics = sldb.statistics
objectid = ObjectId("51948e86c25f4b1d1c0d303c")
analytics = statistics.aggregate([
{'$match': {'owner': objectid}},
{'$project': {'owner': "$owner",
'api_calls_with_key': {'$cond': [{'$eq': ["$apikey", None]}, 0, 1]},
'api_calls_without_key': {'$cond': [{'$ne': ["$apikey", None]}, 0, 1]}
}},
{'$group': {'_id': "$owner",
'api_calls_with_key': {'$sum': "$api_calls_with_key"},
'api_calls_without_key': {'$sum': "$api_calls_without_key"}
}},
{'$project': {'api_calls_with_key': "$api_calls_with_key",
'api_calls_without_key': "$api_calls_without_key",
'api_calls_total': {'$add': ["$api_calls_with_key", "$api_calls_without_key"]},
'api_calls_per_day': {'$divide': [{'$add': ["$api_calls_with_key", "$api_calls_without_key"]}, {'$dayOfMonth': datetime.now()}]},
}}
])
print(analytics)
return analytics
db is well connected and collection is there too and I got back valid expected result but when i try to return it gives me Json error. Any idea how to convert the response back into JSON. Thanks
Pymongo provides json_util - you can use that one instead to handle BSON types
def parse_json(data):
return json.loads(json_util.dumps(data))
You should define you own JSONEncoder and using it:
import json
from bson import ObjectId
class JSONEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, ObjectId):
return str(o)
return json.JSONEncoder.default(self, o)
JSONEncoder().encode(analytics)
It's also possible to use it in the following way.
json.encode(analytics, cls=JSONEncoder)
>>> from bson import Binary, Code
>>> from bson.json_util import dumps
>>> dumps([{'foo': [1, 2]},
... {'bar': {'hello': 'world'}},
... {'code': Code("function x() { return 1; }")},
... {'bin': Binary("")}])
'[{"foo": [1, 2]}, {"bar": {"hello": "world"}}, {"code": {"$code": "function x() { return 1; }", "$scope": {}}}, {"bin": {"$binary": "AQIDBA==", "$type": "00"}}]'
Actual example from json_util.
Unlike Flask's jsonify, "dumps" will return a string, so it cannot be used as a 1:1 replacement of Flask's jsonify.
But this question shows that we can serialize using json_util.dumps(), convert back to dict using json.loads() and finally call Flask's jsonify on it.
Example (derived from previous question's answer):
from bson import json_util, ObjectId
import json
#Lets create some dummy document to prove it will work
page = {'foo': ObjectId(), 'bar': [ObjectId(), ObjectId()]}
#Dump loaded BSON to valid JSON string and reload it as dict
page_sanitized = json.loads(json_util.dumps(page))
return page_sanitized
This solution will convert ObjectId and others (ie Binary, Code, etc) to a string equivalent such as "$oid."
JSON output would look like this:
{
"_id": {
"$oid": "abc123"
}
}
Most users who receive the "not JSON serializable" error simply need to specify default=str when using json.dumps. For example:
json.dumps(my_obj, default=str)
This will force a conversion to str, preventing the error. Of course then look at the generated output to confirm that it is what you need.
from bson import json_util
import json
#app.route('/')
def index():
for _ in "collection_name".find():
return json.dumps(i, indent=4, default=json_util.default)
This is the sample example for converting BSON into JSON object. You can try this.
As a quick replacement, you can change {'owner': objectid} to {'owner': str(objectid)}.
But defining your own JSONEncoder is a better solution, it depends on your requirements.
Posting here as I think it may be useful for people using Flask with pymongo. This is my current "best practice" setup for allowing flask to marshall pymongo bson data types.
mongoflask.py
from datetime import datetime, date
import isodate as iso
from bson import ObjectId
from flask.json import JSONEncoder
from werkzeug.routing import BaseConverter
class MongoJSONEncoder(JSONEncoder):
def default(self, o):
if isinstance(o, (datetime, date)):
return iso.datetime_isoformat(o)
if isinstance(o, ObjectId):
return str(o)
else:
return super().default(o)
class ObjectIdConverter(BaseConverter):
def to_python(self, value):
return ObjectId(value)
def to_url(self, value):
return str(value)
app.py
from .mongoflask import MongoJSONEncoder, ObjectIdConverter
def create_app():
app = Flask(__name__)
app.json_encoder = MongoJSONEncoder
app.url_map.converters['objectid'] = ObjectIdConverter
# Client sends their string, we interpret it as an ObjectId
#app.route('/users/<objectid:user_id>')
def show_user(user_id):
# setup not shown, pretend this gets us a pymongo db object
db = get_db()
# user_id is a bson.ObjectId ready to use with pymongo!
result = db.users.find_one({'_id': user_id})
# And jsonify returns normal looking json!
# {"_id": "5b6b6959828619572d48a9da",
# "name": "Will",
# "birthday": "1990-03-17T00:00:00Z"}
return jsonify(result)
return app
Why do this instead of serving BSON or mongod extended JSON?
I think serving mongo special JSON puts a burden on client applications. Most client apps will not care using mongo objects in any complex way. If I serve extended json, now I have to use it server side, and the client side. ObjectId and Timestamp are easier to work with as strings and this keeps all this mongo marshalling madness quarantined to the server.
{
"_id": "5b6b6959828619572d48a9da",
"created_at": "2018-08-08T22:06:17Z"
}
I think this is less onerous to work with for most applications than.
{
"_id": {"$oid": "5b6b6959828619572d48a9da"},
"created_at": {"$date": 1533837843000}
}
For those who need to return the data thru Jsonify with Flask:
cursor = db.collection.find()
data = []
for doc in cursor:
doc['_id'] = str(doc['_id']) # This does the trick!
data.append(doc)
return jsonify(data)
You could try:
objectid = str(ObjectId("51948e86c25f4b1d1c0d303c"))
in my case I needed something like this:
class JsonEncoder():
def encode(self, o):
if '_id' in o:
o['_id'] = str(o['_id'])
return o
This is how I've recently fixed the error
#app.route('/')
def home():
docs = []
for doc in db.person.find():
doc.pop('_id')
docs.append(doc)
return jsonify(docs)
I know I'm posting late but thought it would help at least a few folks!
Both the examples mentioned by tim and defuz(which are top voted) works perfectly fine. However, there is a minute difference which could be significant at times.
The following method adds one extra field which is redundant and may not be ideal in all the cases
Pymongo provides json_util - you can use that one instead to handle BSON types
Output: {
"_id": {
"$oid": "abc123"
}
}
Where as the JsonEncoder class gives the same output in the string format as we need and we need to use json.loads(output) in addition. But it leads to
Output: {
"_id": "abc123"
}
Even though, the first method looks simple, both the method need very minimal effort.
I would like to provide an additional solution that improves the accepted answer. I have previously provided the answers in another thread here.
from flask import Flask
from flask.json import JSONEncoder
from bson import json_util
from . import resources
# define a custom encoder point to the json_util provided by pymongo (or its dependency bson)
class CustomJSONEncoder(JSONEncoder):
def default(self, obj): return json_util.default(obj)
application = Flask(__name__)
application.json_encoder = CustomJSONEncoder
if __name__ == "__main__":
application.run()
If you will not be needing the _id of the records I will recommend unsetting it when querying the DB which will enable you to print the returned records directly e.g
To unset the _id when querying and then print data in a loop you write something like this
records = mycollection.find(query, {'_id': 0}) #second argument {'_id':0} unsets the id from the query
for record in records:
print(record)
If you want to send it as a JSON response you need to format in two steps
Using json_util.dumps() from bson to covert ObjectId in BSON response to
JSON compatible format i.e. "_id": {"$oid": "123456789"}
The above JSON Response obtained from json_util.dumps() will have backslashes and quotes
To remove backslashes and quotes use json.loads() from json
from bson import json_util
import json
bson_data = [{'_id': ObjectId('123456789'), 'field': 'somedata'},{'_id': ObjectId('123456781'), 'field': 'someMoredata'}]
json_data_with_backslashes = json_util.dumps(bson_data)
# output will look like this
# "[{\"_id\": {\"$oid\": \"123456789\"}, \"field\": \"somedata\"},{\"_id\": {\"$oid\": \"123456781\"}, \"field\": \"someMoredata\"}]"
json_data = json.loads(json_data_with_backslashes)
# output will look like this
# [{"_id": {"$oid": "123456789"},"field": "somedata"},{"_id": {"$oid": "123456781"},"field": "someMoredata"}]
Flask's jsonify provides security enhancement as described in JSON Security. If custom encoder is used with Flask, its better to consider the
points discussed in the JSON Security
If you don't want _id in response, you can refactor your code something like this:
jsonResponse = getResponse(mock_data)
del jsonResponse['_id'] # removes '_id' from the final response
return jsonResponse
This will remove the TypeError: ObjectId('') is not JSON serializable error.
from bson.objectid import ObjectId
from core.services.db_connection import DbConnectionService
class DbExecutionService:
def __init__(self):
self.db = DbConnectionService()
def list(self, collection, search):
session = self.db.create_connection(collection)
return list(map(lambda row: {i: str(row[i]) if isinstance(row[i], ObjectId) else row[i] for i in row}, session.find(search))
SOLUTION for: mongoengine + marshmallow
If you use mongoengine and marshamallow then this solution might be applicable for you.
Basically, I imported String field from marshmallow, and I overwritten default Schema id to be String encoded.
from marshmallow import Schema
from marshmallow.fields import String
class FrontendUserSchema(Schema):
id = String()
class Meta:
fields = ("id", "email")

Return MongoEngine Documents as JSON

Not too sure if this is really simple or not, but I can't really find anything on the topic. But, either using the regular MongoEngine library, or even Flask-MongoEngine for my Flask based website, would it be possible to return a MongoEngine document as straight JSON?
Thanks!
In 0.8 there are helpers see https://github.com/MongoEngine/mongoengine/issues/1
in the meantime you have to use pymongo's json_utils directly:
from bson import json_util
json_util.dumps(MyDoc._collection_obj.find(MyDoc.objects()._query))
Ross's and Jellyflower's workarounds don't work when field projection or ordering is used.
More general workaround:
from bson import json_util
json = json_util.dumps(query._cursor)
The correct workaround should probably be:
from bson import json_util
objects = MyDoc.objects()
json_util.dumps(objects._collection_obj.find(objects._query))
Update: thanks to Lo-Tan for to_mongo() method usage suggestion.
Eventually I came up with the following solution:
from json import JSONEncoder
from mongoengine.base import BaseDocument
class MongoEncoder(JSONEncoder):
def default(self, o):
if isinstance(o, BaseDocument):
data = o.to_mongo()
# might not be present if EmbeddedDocument
o_id = data.pop('_id', None)
if o_id:
data['id'] = str(o_id['$oid'])
data.pop('_cls', None)
return data
else:
return JSONEncoder.default(self, o)
# consider `obj` to be MongoEngine object
json_data = json.dumps(obj, cls=MongoEncoder)
It uses to_json() method, added as the response to the aforementioned issue.

Categories