MongoDB doesn't handle aggregation with allowDiskUsage:True - python

the data structure is like:
way: {
_id:'9762264'
node: ['253333910', '3304026514']
}
and I'm trying to count the frequency of nodes' appearance in ways. Here is my code using pymongo:
node = db.way.aggregate([
{'$unwind': '$node'},
{
'$group': {
'_id': '$node',
'appear_count': {'$sum': 1}
}
},
{'$sort': {'appear_count': -1}},
{'$limit': 10}
],
{'allowDiskUse': True}
)
it will report an error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File ".../OSM Wrangling/explore.py", line 78, in most_passed_node
{'allowDiskUse': True}
File ".../pymongo/collection.py", line 2181, in aggregate
**kwargs)
File ".../pymongo/collection.py", line 2088, in _aggregate
client=self.__database.client)
File ".../pymongo/pool.py", line 464, in command
self.validate_session(client, session)
File ".../pymongo/pool.py", line 609, in validate_session
if session._client is not client:
AttributeError: 'dict' object has no attribute '_client'
However, if I removed the {'allowDiskUse': True} and test it on a smaller set of data, it works well. It seems that the allowDiskUse statement brings something wrong? And there is no information about this mistake in the docs of MongoDB
How should I solve this problem and get the answer I want?

How should I solve this problem and get the answer I want?
This is because in PyMongo v3.6 the method signature for collection.aggregate() has been changed. An optional parameter for session has been added.
The method signature now is :
aggregate(pipeline, session=None, **kwargs)
Applying this to your code example, you can specify allowDiskUse as below:
node = db.way.aggregate(pipeline=[
{'$unwind': '$node'},
{'$group': {
'_id': '$node',
'appear_count': {'$sum': 1}
}
},
{'$sort': {'appear_count': -1}},
{'$limit': 10}
],
allowDiskUse=True
)
See also pymongo.client_session if you would like to know more about session.

js is case sensitive, please use lowercase boolean true
{'allowDiskUse': true}

Related

Why MongoEngine/pymongo giving error when trying to access object first time only

I have defined MongoEngine classes which are mapped with MongoDB. When I am trying to access the data using MongoEngine, at the specific code it is failing at first attempt but successfully returns data in the second attempt with the same code. Executing the code in python terminal
from Project.Mongo import User
user = User.objects(username = 'xyz#xyz.com').first()
from Project.Mongo import Asset
Asset.objects(org = user.org)
Last line from code generating the following error in first attempt.
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.5/dist-packages/mongoengine/queryset/manager.py", line 37, in get
queryset = queryset_class(owner, owner._get_collection())
File "/usr/local/lib/python3.5/dist-packages/mongoengine/document.py", line 209, in _get_collection
cls.ensure_indexes()
File "/usr/local/lib/python3.5/dist-packages/mongoengine/document.py", line 765, in ensure_indexes
collection.create_index(fields, background=background, **opts)
File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 1754, in create_index
self.__create_index(keys, kwargs, session, **cmd_options)
File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 1656, in __create_index
session=session)
File "/usr/local/lib/python3.5/dist-packages/pymongo/collection.py", line 245, in _command
retryable_write=retryable_write)
File "/usr/local/lib/python3.5/dist-packages/pymongo/pool.py", line 517, in command
collation=collation)
File "/usr/local/lib/python3.5/dist-packages/pymongo/network.py", line 125, in command
parse_write_concern_error=parse_write_concern_error)
File "/usr/local/lib/python3.5/dist-packages/pymongo/helpers.py", line 145, in _check_command_response
raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: Index: { v: 2, key: { org: 1, _fts: "text", _ftsx: 1 }, name: "org_1_name_content_text_description_text_content_text_tag_content_text_remote.source_text", ns: "digitile.asset", weights: { content: 3, description: 1, name_content: 10, remote.owner__name: 20, remote.source: 2, tag_content: 2 }, default_language: "english", background: false, language_override: "language", textIndexVersion: 3 } already exists with different options: { v: 2, key: { org: 1, _fts: "text", _ftsx: 1 }, name: "org_1_name_text_description_text_content_text_tag_content_text_remote.source_text", ns: "digitile.asset", default_language: "english", background: false, weights: { content: 3, description: 1, name: 10, remote.owner__name: 20, remote.source: 2, tag_content: 2 }, language_override: "language", textIndexVersion: 3 }
When I try same last line second time, it produces accurate result
I am using python 3.5.2
pymongo 3.7.2
mongoengine 0.10.6
The first time you call .objects on a document class, mongoengine tries to create the indexes if they don't exist.
In this case it fails during the creation of an index on the asset collection (detail of indexes are taken from your Asset/User Document classes) as you can see in the error message:
pymongo.errors.OperationFailure: Index: {...new index details...} already exists with different options {...existing index details...}.
The second time you make that call, mongoengine assumes that the indexes were created and isn't attempting to create it again, which explains why the second call passes.

How do you create a Adwords BigQuery Transfer and Transfer Runs using the bigquery_datatransfer Python client?

I have been able to successfully authenticate, and list transfers and transfer runs. But I keep running into the issue of not being able to create a transfer because the transfer config is incorrect.
Here's the Transfer Config I have tried:
transferConfig = {
'data_refresh_window_days': 1,
'data_source_id': "adwords",
'destination_dataset_id': "AdwordsMCC",
'disabled': False,
'display_name': "TestR",
'name': "TestR",
'schedule': "every day 07:00",
'params': {
"customer_id": "999999999" -- Changed Number
}
}
response = client.create_transfer_config(parent, transferConfig)
print(response)
And this is the error I get:
Traceback (most recent call last):
File "./create_transfer.py", line 84, in <module>
main()
File "./create_transfer.py", line 61, in main
response = client.create_transfer_config(parent, transferConfig)
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/cloud/bigquery_datatransfer_v1/gapic/data_transfer_service_client.py", line 438, in create_transfer_config
authorization_code=authorization_code)
ValueError: Protocol message Struct has no "customer_id" field.
DDIS:bigquery siddharthsudheer$ ./create_transfer.py
Traceback (most recent call last):
File "./create_transfer.py", line 84, in <module>
main()
File "./create_transfer.py", line 61, in main
response = client.create_transfer_config(parent, transferConfig)
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/cloud/bigquery_datatransfer_v1/gapic/data_transfer_service_client.py", line 438, in create_transfer_config
authorization_code=authorization_code)
ValueError: Protocol message Struct has no "customer_id" field.
I managed to set up a Data Transfer through the API by defining the params as class google.protobuf.struct_pb2.Struct.
Try if the adding the following works for you:
from google.protobuf.struct_pb2 import Struct
params = Struct()
params["customer_id"] = "999999999"
And then changing your transferConfig to:
transferConfig = {
'data_refresh_window_days': 1,
'data_source_id': "adwords",
'destination_dataset_id': "AdwordsMCC",
'disabled': False,
'display_name': "TestR",
'name': "TestR",
'schedule': "every day 07:00",
'params': params
}
}

Creating new orders with Python API, get AttributeError: 'str' object has no attribute 'iteritems'

The code I have that's causing this is
new_order = shopify.Order.create(json.dumps({'order': { "email": "foo#example.com", "fulfillment_status": "fulfilled", "line_items": [{'message': "words go here"}]}}))
I tried without the json.dumps and got the response that it was an unhashable type. also tried this from some reasearch
data = dict()
data['order']= { "email": "foo#example.com", "fulfillment_status": "fulfilled", "line_items": [{'message': "words go here"}]}
print(data['order'])
new_order = shopify.Order.create(json.dumps(data))
What can I do to properly send in a simple order like in https://help.shopify.com/api/reference/order#create
C:\Python27\python.exe C:/Users/Kris/Desktop/moon_story/story_app.py
Traceback (most recent call last):
File "C:/Users/Kris/Desktop/moon_story/story_app.py", line 41, in <module>
{'fulfillment_status': 'fulfilled', 'email': 'foo#example.com', 'line_items': [{'message': 'words go here'}]}
get_story(1520)
File "C:/Users/Kris/Desktop/moon_story/story_app.py", line 29, in get_story
new_order = shopify.Order.create(json.dumps(data))
File "C:\Python27\lib\site-packages\pyactiveresource\activeresource.py", line 448, in create
resource = cls(attributes)
File "C:\Python27\lib\site-packages\shopify\base.py", line 126, in __init__
prefix_options, attributes = self.__class__._split_options(attributes)
File "C:\Python27\lib\site-packages\pyactiveresource\activeresource.py", line 465, in _split_options
for key, value in six.iteritems(options):
File "C:\Python27\lib\site-packages\six.py", line 599, in iteritems
return d.iteritems(**kw)
AttributeError: 'str' object has no attribute 'iteritems'
After some digging, I was able to get this working. You shouldn't need to do anything special with the argument passed to create. The following works for me:
shop_url = "https://%s:%s#%s.myshopify.com/admin" % (shopify_key, shopify_pass, shopify_store_name)
shopify.ShopifyResource.set_site(shop_url)
order_data = {
"email": "test#test.com",
"fulfillment_status": "fulfilled",
"line_items": [
{
"title": "ITEM TITLE",
"variant_id": 7214792579,
"quantity": 1,
"price": 895
}
]
}
shopify.Order.create(order_data)
It's worth noting that this Python library relies on another Shopify created library called pyactiveresource. That library provides the underlying create method, which calls the save method.
The save method has the following notes about responses:
Args:
None
Returns:
True on success, False on ResourceInvalid errors (sets the errors
attribute if an <errors> object is returned by the server).
Raises:
connection.Error: On any communications problems.
I was continually getting a False response. This helped me understand which fields were actually required by looking at the errors attribute, so I figured it might be helpful here.
Comment: ... get an order(None) as response. ... Any thoughts?
Comparing with help.shopify.com/api/reference there are the following differences:
The Endpoint have to be /admin/orders.json
Why do you use /admin?
The Main Key in the JSON Dict have to be order.
Why don't you use this, for example:
{
"order": {
"email": "foo#example.com",
"fulfillment_status": "fulfilled",
"line_items": [
{
"variant_id": 447654529,
"quantity": 1
}
]
}
}
Use:
new_order = shopify.Order.create(data['order'])

Error while loading bulk data into Elasticsearch

I am using Elasticsearch in python. I have data in pandas frame(3 columns), then I added two columns _index and _type and converted the data into json with each record using pandas inbuilt method.
data = data.to_json(orient='records')
This is my data then,
[{"op_key":99140046678,"employee_key":991400459,"Revenue Results":6625.76480192,"_index":"revenueindex","_type":"revenuetype"},
{"op_key":99140045489,"employee_key":9914004258,"Revenue Results":6691.05435536,"_index":"revenueindex","_type":"revenuetype"},
......
}]
My mapping is:
user_mapping = {
"settings" : {
"number_of_shards": 3,
"number_of_replicas": 2
},
'mappings': {
'revenuetype': {
'properties': {
'op_key':{'type':'string'},
'employee_key':{'type':'string'},
'Revenue Results':{'type':'float','index':'not_analyzed'},
}
}
}
}
Then facing this error while using helpers.bulk(es,data):
Traceback (most recent call last):
File "/Users/adaggula/Documents/workspace/ElSearchPython/sample.py", line 59, in <module>
res = helpers.bulk(client,data)
File "/Users/adaggula/workspace/python/pve/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 188, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "/Users/adaggula/workspace/python/pve/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 160, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "/Users/adaggula/workspace/python/pve/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 89, in _process_bulk_chunk
raise e
elasticsearch.exceptions.RequestError: TransportError(400, u'action_request_validation_exception', u'Validation Failed: 1: index is
missing;2: type is missing;3: index is missing;4: type is missing;5: index is
missing;6: ....... type is missing;999: index is missing;1000: type is missing;')
It looks like for every json object, index and type's are missing. How to overcome this?
Pandas Data frame to json conversion is the trick which resolved the problem.
data = data.to_json(orient='records')
data= json.loads(data)

Elastic Search: pyes.exceptions.IndexMissingException exception from search result

This is a question about Elastic-Search python API (pyes).
I run a very simple testcase through curl, and everything seems to work as expected.
Here is the description of the curl test-case:
The only document that exists in the ES is:
curl 'http://localhost:9200/test/index1' -d '{"page_text":"This is the text that was found on the page!"}
Then I search the ES for all documents that the word "found" exists in. The result seems to be OK:
curl 'http://localhost:9200/test/index1/_search?q=page_text:found&pretty=true'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.15342641,
"hits" : [ {
"_index" : "test",
"_type" : "index1",
"_id" : "uaxRHpQZSpuicawk69Ouwg",
"_score" : 0.15342641, "_source" : {"page_text":"This is the text that was found on the page!"}
} ]
}
}
However, when I run the same query though python2.7 api (pyes), something goes wrong:
>>> import pyes
>>> conn = pyes.ES('localhost:9200')
>>> result = conn.search({"page_text":"found"}, index="index1")
>>> print result
<pyes.es.ResultSet object at 0xd43e50>
>>> result.count()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/pythonbrew/pythons/Python-2.7.3/lib/python2.7/site-packages/pyes/es.py", line 1717, in count
return self.total
File "/usr/local/pythonbrew/pythons/Python-2.7.3/lib/python2.7/site-packages/pyes/es.py", line 1686, in total
self._do_search()
File "/usr/local/pythonbrew/pythons/Python-2.7.3/lib/python2.7/site-packages/pyes/es.py", line 1646, in _do_search
doc_types=self.doc_types, **self.query_params)
File "/usr/local/pythonbrew/pythons/Python-2.7.3/lib/python2.7/site-packages/pyes/es.py", line 1381, in search_raw
return self._query_call("_search", body, indices, doc_types, **query_params)
File "/usr/local/pythonbrew/pythons/Python-2.7.3/lib/python2.7/site-packages/pyes/es.py", line 622, in _query_call
return self._send_request('GET', path, body, params=querystring_args)
File "/usr/local/pythonbrew/pythons/Python-2.7.3/lib/python2.7/site-packages/pyes/es.py", line 603, in _send_request
raise_if_error(response.status, decoded)
File "/usr/local/pythonbrew/pythons/Python-2.7.3/lib/python2.7/site-packages/pyes/convert_errors.py", line 83, in raise_if_error
raise excClass(msg, status, result, request)
pyes.exceptions.IndexMissingException: [_all] missing
As you can see, pyes returns the result object, but from some reason I can't even get the number of results there.
Anyone was any guess what may be wrong here?
Thanks a lot in advance!
The name of the parameter changed, it's no longer called index, it's called indices and it's a list:
>>> result = conn.search({"page_text":"found"}, indices=["index1"])

Categories