Error while loading bulk data into Elasticsearch

Error while loading bulk data into Elasticsearch - python

I am using Elasticsearch in python. I have data in pandas frame(3 columns), then I added two columns _index and _type and converted the data into json with each record using pandas inbuilt method.
data = data.to_json(orient='records')
This is my data then,
[{"op_key":99140046678,"employee_key":991400459,"Revenue Results":6625.76480192,"_index":"revenueindex","_type":"revenuetype"},
{"op_key":99140045489,"employee_key":9914004258,"Revenue Results":6691.05435536,"_index":"revenueindex","_type":"revenuetype"},
......
}]
My mapping is:
user_mapping = {
"settings" : {
"number_of_shards": 3,
"number_of_replicas": 2
},
'mappings': {
'revenuetype': {
'properties': {
'op_key':{'type':'string'},
'employee_key':{'type':'string'},
'Revenue Results':{'type':'float','index':'not_analyzed'},
}
}
}
}
Then facing this error while using helpers.bulk(es,data):
Traceback (most recent call last):
File "/Users/adaggula/Documents/workspace/ElSearchPython/sample.py", line 59, in <module>
res = helpers.bulk(client,data)
File "/Users/adaggula/workspace/python/pve/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 188, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "/Users/adaggula/workspace/python/pve/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 160, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "/Users/adaggula/workspace/python/pve/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 89, in _process_bulk_chunk
raise e
elasticsearch.exceptions.RequestError: TransportError(400, u'action_request_validation_exception', u'Validation Failed: 1: index is
missing;2: type is missing;3: index is missing;4: type is missing;5: index is
missing;6: ....... type is missing;999: index is missing;1000: type is missing;')
It looks like for every json object, index and type's are missing. How to overcome this?

Pandas Data frame to json conversion is the trick which resolved the problem.
data = data.to_json(orient='records')
data= json.loads(data)

Related

How do I do multiple JSON entries with Python?

I'm trying to pull some data from a flight simulation JSON table. It's updated every 15 seconds and I've been trying to pull print(obj['pilots']['flight_plans']['cid']). However im getting the error
Traceback (most recent call last):
File "main.py", line 18, in <module>
print(obj['pilots']['flight_plans']['cid'])
TypeError: list indices must be integers or slices, not str
My code is below
import json
from urllib.request import urlopen
import urllib
# initial setup
URL = "https://data.vatsim.net/v3/vatsim-data.json"
# json entries
response = urllib.request.urlopen(URL)
str_response = response.read().decode('utf-8')
obj = json.loads(str_response)
# result is connections
# print(obj["general"]["connected_clients"])
print(obj['pilots']['flight_plans']['cid'])
The print(obj["general"]["connected_clients"]) does work.

Investigate your obj with print(json.dumps(obj,indent=2). You'll find that the pilots key is a list of dictionaries containing flight_plan (not plural) and cid keys. Here's the first few lines:
{
"general": {
"version": 3,
"reload": 1,
"update": "20220301062202",
"update_timestamp": "2022-03-01T06:22:02.245318Z",
"connected_clients": 292,
"unique_users": 282
},
"pilots": [
{
"cid": 1149936,
"name": "1149936",
"callsign": "URO504",
"server": "UK",
"pilot_rating": 0,
"latitude": -23.39706,
"longitude": -46.3709,
"altitude": 9061,
"groundspeed": 327,
"transponder": "0507",
"heading": 305,
"qnh_i_hg": 29.97,
"qnh_mb": 1015,
"flight_plan": {
"flight_rules": "I",
"aircraft": "A346",
...
For example, iterate over the list of pilots and print name/cid:
for pilot in obj['pilots']:
print(pilot['name'],pilot['cid'])
Output:
1149936 1149936
Nick Aydin OTHH 1534423
Oguz Aydin 1429318
Marvin Steglich LSZR 1482019
Daniel Krol EPKK 1279199
... etc ...

Cannot do batch_update on Google Sheet with more than 999 rows

Getting the following error when trying to do a batch_update post to a google sheet. There are 5600 rows in the sheet I am trying to post to
('/home/xx/xxx/xx.csv', <Spreadsheet u'spreadsheet name' id:id#>, 'A5600')
Traceback (most recent call last):
File "xx.py", line 50, in <module>
pasteCsv(csvFile, sheet, cell)
File "xx.py", line 38, in pasteCsv
return sheet.batch_update(body)
File "/home/xx/.local/lib/python2.7/site-packages/gspread/models.py", line 146, in batch_update
'post', SPREADSHEET_BATCH_UPDATE_URL % self.id, json=body
File "/home/xx/.local/lib/python2.7/site-packages/gspread/client.py", line 73, in request
raise APIError(response)
gspread.exceptions.APIError: {u'status': u'INVALID_ARGUMENT', u'message': u'Invalid requests[0].pasteData: GridCoordinate.rowIndex[5599] is after last row in grid[999]', u'code': 400}
is there a way to change the grid from [999] to a higher number so that I am able to post the csv file contents?

Answer:
You can make a batch request to increase the number of rows in the sheet before you insert the CSV content.
Example using a Batch Request:
spreadsheetId = "your-spreadsheet-id"
sheetId = "sheet-id"
sh = client.open_by_key(spreadsheetId)
request = {
"requests": [
{
"insertDimension": {
"range": {
"sheetId": sheetId,
"dimension": "ROWS",
"startIndex": 999,
"endIndex": 5599
},
"inheritFromBefore": false
}
}
]
}
result = sh.batch_update(request)
You will need to change sheetId to be the gid of the sheet within the Spreadsheet you are updating.
Remember: rows and columns are 0-indexed, so inserting rows below row 1000 will mean having a startIndex of 999.
Example using gspread methods:
Alternatively in gspread you can directly use the gspread.models.Worksheet.add_rows() method:
sh = client.open_by_key(spreadsheetId)
ws = sh.get_worksheet(index)
ws.add_rows(4600)
References:
Row and Column Operations | Sheets API | Google Developers
Insert an empty row or column
API Reference - gspread 3.6.0 documentation - add_rows(rows)

How do you create a Adwords BigQuery Transfer and Transfer Runs using the bigquery_datatransfer Python client?

I have been able to successfully authenticate, and list transfers and transfer runs. But I keep running into the issue of not being able to create a transfer because the transfer config is incorrect.
Here's the Transfer Config I have tried:
transferConfig = {
'data_refresh_window_days': 1,
'data_source_id': "adwords",
'destination_dataset_id': "AdwordsMCC",
'disabled': False,
'display_name': "TestR",
'name': "TestR",
'schedule': "every day 07:00",
'params': {
"customer_id": "999999999" -- Changed Number
}
}
response = client.create_transfer_config(parent, transferConfig)
print(response)
And this is the error I get:
Traceback (most recent call last):
File "./create_transfer.py", line 84, in <module>
main()
File "./create_transfer.py", line 61, in main
response = client.create_transfer_config(parent, transferConfig)
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/cloud/bigquery_datatransfer_v1/gapic/data_transfer_service_client.py", line 438, in create_transfer_config
authorization_code=authorization_code)
ValueError: Protocol message Struct has no "customer_id" field.
DDIS:bigquery siddharthsudheer$ ./create_transfer.py
Traceback (most recent call last):
File "./create_transfer.py", line 84, in <module>
main()
File "./create_transfer.py", line 61, in main
response = client.create_transfer_config(parent, transferConfig)
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/google/cloud/bigquery_datatransfer_v1/gapic/data_transfer_service_client.py", line 438, in create_transfer_config
authorization_code=authorization_code)
ValueError: Protocol message Struct has no "customer_id" field.

I managed to set up a Data Transfer through the API by defining the params as class google.protobuf.struct_pb2.Struct.
Try if the adding the following works for you:
from google.protobuf.struct_pb2 import Struct
params = Struct()
params["customer_id"] = "999999999"
And then changing your transferConfig to:
transferConfig = {
'data_refresh_window_days': 1,
'data_source_id': "adwords",
'destination_dataset_id': "AdwordsMCC",
'disabled': False,
'display_name': "TestR",
'name': "TestR",
'schedule': "every day 07:00",
'params': params
}
}

MongoDB doesn't handle aggregation with allowDiskUsage:True

the data structure is like:
way: {
_id:'9762264'
node: ['253333910', '3304026514']
}
and I'm trying to count the frequency of nodes' appearance in ways. Here is my code using pymongo:
node = db.way.aggregate([
{'$unwind': '$node'},
{
'$group': {
'_id': '$node',
'appear_count': {'$sum': 1}
}
},
{'$sort': {'appear_count': -1}},
{'$limit': 10}
],
{'allowDiskUse': True}
)
it will report an error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File ".../OSM Wrangling/explore.py", line 78, in most_passed_node
{'allowDiskUse': True}
File ".../pymongo/collection.py", line 2181, in aggregate
**kwargs)
File ".../pymongo/collection.py", line 2088, in _aggregate
client=self.__database.client)
File ".../pymongo/pool.py", line 464, in command
self.validate_session(client, session)
File ".../pymongo/pool.py", line 609, in validate_session
if session._client is not client:
AttributeError: 'dict' object has no attribute '_client'
However, if I removed the {'allowDiskUse': True} and test it on a smaller set of data, it works well. It seems that the allowDiskUse statement brings something wrong? And there is no information about this mistake in the docs of MongoDB
How should I solve this problem and get the answer I want?

How should I solve this problem and get the answer I want?
This is because in PyMongo v3.6 the method signature for collection.aggregate() has been changed. An optional parameter for session has been added.
The method signature now is :
aggregate(pipeline, session=None, **kwargs)
Applying this to your code example, you can specify allowDiskUse as below:
node = db.way.aggregate(pipeline=[
{'$unwind': '$node'},
{'$group': {
'_id': '$node',
'appear_count': {'$sum': 1}
}
},
{'$sort': {'appear_count': -1}},
{'$limit': 10}
],
allowDiskUse=True
)
See also pymongo.client_session if you would like to know more about session.

js is case sensitive, please use lowercase boolean true
{'allowDiskUse': true}

Creating new orders with Python API, get AttributeError: 'str' object has no attribute 'iteritems'

The code I have that's causing this is
new_order = shopify.Order.create(json.dumps({'order': { "email": "foo#example.com", "fulfillment_status": "fulfilled", "line_items": [{'message': "words go here"}]}}))
I tried without the json.dumps and got the response that it was an unhashable type. also tried this from some reasearch
data = dict()
data['order']= { "email": "foo#example.com", "fulfillment_status": "fulfilled", "line_items": [{'message': "words go here"}]}
print(data['order'])
new_order = shopify.Order.create(json.dumps(data))
What can I do to properly send in a simple order like in https://help.shopify.com/api/reference/order#create
C:\Python27\python.exe C:/Users/Kris/Desktop/moon_story/story_app.py
Traceback (most recent call last):
File "C:/Users/Kris/Desktop/moon_story/story_app.py", line 41, in <module>
{'fulfillment_status': 'fulfilled', 'email': 'foo#example.com', 'line_items': [{'message': 'words go here'}]}
get_story(1520)
File "C:/Users/Kris/Desktop/moon_story/story_app.py", line 29, in get_story
new_order = shopify.Order.create(json.dumps(data))
File "C:\Python27\lib\site-packages\pyactiveresource\activeresource.py", line 448, in create
resource = cls(attributes)
File "C:\Python27\lib\site-packages\shopify\base.py", line 126, in __init__
prefix_options, attributes = self.__class__._split_options(attributes)
File "C:\Python27\lib\site-packages\pyactiveresource\activeresource.py", line 465, in _split_options
for key, value in six.iteritems(options):
File "C:\Python27\lib\site-packages\six.py", line 599, in iteritems
return d.iteritems(**kw)
AttributeError: 'str' object has no attribute 'iteritems'

After some digging, I was able to get this working. You shouldn't need to do anything special with the argument passed to create. The following works for me:
shop_url = "https://%s:%s#%s.myshopify.com/admin" % (shopify_key, shopify_pass, shopify_store_name)
shopify.ShopifyResource.set_site(shop_url)
order_data = {
"email": "test#test.com",
"fulfillment_status": "fulfilled",
"line_items": [
{
"title": "ITEM TITLE",
"variant_id": 7214792579,
"quantity": 1,
"price": 895
}
]
}
shopify.Order.create(order_data)
It's worth noting that this Python library relies on another Shopify created library called pyactiveresource. That library provides the underlying create method, which calls the save method.
The save method has the following notes about responses:
Args:
None
Returns:
True on success, False on ResourceInvalid errors (sets the errors
attribute if an <errors> object is returned by the server).
Raises:
connection.Error: On any communications problems.
I was continually getting a False response. This helped me understand which fields were actually required by looking at the errors attribute, so I figured it might be helpful here.

Comment: ... get an order(None) as response. ... Any thoughts?
Comparing with help.shopify.com/api/reference there are the following differences:
The Endpoint have to be /admin/orders.json
Why do you use /admin?
The Main Key in the JSON Dict have to be order.
Why don't you use this, for example:
{
"order": {
"email": "foo#example.com",
"fulfillment_status": "fulfilled",
"line_items": [
{
"variant_id": 447654529,
"quantity": 1
}
]
}
}
Use:
new_order = shopify.Order.create(data['order'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error while loading bulk data into Elasticsearch - python

Pandas Data frame to json conversion is the trick which resolved the problem. data = data.to_json(orient='records') data= json.loads(data)

Related

How do I do multiple JSON entries with Python?

Cannot do batch_update on Google Sheet with more than 999 rows

How do you create a Adwords BigQuery Transfer and Transfer Runs using the bigquery_datatransfer Python client?

MongoDB doesn't handle aggregation with allowDiskUsage:True

Creating new orders with Python API, get AttributeError: 'str' object has no attribute 'iteritems'

Categories

Resources