Indexing dataframe into elasticsearch using Python - python

I'm trying to indexing some pandas dataframe into ElasticSearch. I have some troubles while parsing the json that I'm generating. I think that my problem is coming from the mapping. Please below find my code.
import logging
from pprint import pprint
from elasticsearch import Elasticsearch
import pandas as pd
def create_index(es_object, index_name):
created = False
# index settings
settings = {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"danger": {
"dynamic": "strict",
"properties": {
"name": {
"type": "text"
},
"first_name": {
"type": "text"
},
"age": {
"type": "integer"
},
"city": {
"type": "text"
},
"sex": {
"type": "text",
},
}
}
}
}
try:
if not es_object.indices.exists(index_name):
#Ignore 400means to ignore "Index Already Exist" error
es_object.indices.create(index=index_name, ignore=400,
body=settings)
print('Created Index')
created = True
except Exception as ex:
print(str(ex))
finally:
return created
def store_record(elastic_object, index_name, record):
is_stored = True
try:
outcome = elastic_object.index(index=index_name,doc_type='danger', body=record)
print(outcome)
except Exception as ex:
print('Error in indexing data')
data = [['Hook', 'James','90', 'Austin','M'],['Sparrow','Jack','15', 'Paris', 'M'],['Kent','Clark','13', 'NYC', 'M'],['Montana','Hannah','28','Las Vegas', 'F'] ]
df = pd.DataFrame(data,columns=['name', 'first_name', 'age', 'city', 'sex'])
result = df.to_json(orient='records')
result = result[1:-1]
es = Elasticsearch()
if es is not None:
if create_index(es, 'cracra'):
out = store_record(es, 'cracra', result)
print('Data indexed successfully')
I got the following error
POST http://localhost:9200/cracra/danger [status:400 request:0.016s]
Error in indexing data
RequestError(400, 'mapper_parsing_exception', 'failed to parse')
Data indexed successfully
I don't know where it is coming from. If anyone may help me to solve this, I would be grateful.
Thanks a lot !

Try to remove extra commas from your mappings:
"mappings": {
"danger": {
"dynamic": "strict",
"properties": {
"name": {
"type": "text"
},
first_name": {
"type": "text"
},
"age": {
"type": "integer"
},
"city": {
"type": "text"
},
"sex": {
"type": "text", <-- here
}, <-- and here
}
}
}
UPDATE
It seems that the index is created successfully and the problem is in data indexing. As Nishant Saini noted you probably are trying to index several documents at a time. It can be done using Bulk API. Here is the example of correct request that indexes two documents:
POST cracra/danger/_bulk
{"index": {"_id": 1}}
{"name": "Hook", "first_name": "James", "age": "90", "city": "Austin", "sex": "M"}
{"index": {"_id": 2}}
{"name": "Sparrow", "first_name": "Jack", "age": "15", "city": "Paris", "sex": "M"}
Every document in the request body must appear in the new line with some meta information before it. In this case metainfo contains only id that must be assigned to the document.
You can either make this query by hand or use Elasticsearch Helpers for Python that can take care of adding correct metainfo.

Related

Python : How to loop through data to access similar keys present inside nested dict

I have an API, after calling which I'm getting a very big json in response.
I want to access similar keys which are present inside the nested dict.
I'm using following lines to make a get request and storing the json data : -
p25_st_devices = r'https://url_from_where_im_getting_data.com'
header_events = {
'Authorization': 'Basic random_keys'}
r2 = requests.get(p25_st_devices, headers= header_events)
r2_json = json.loads(r2.content)
The sample of the json is as follows : -
{
"next": "value",
"self": "value",
"managedObjects": [
{
"creationTime": "2021-08-02T10:48:15.120Z",
"type": " c8y_MQTTdevice",
"lastUpdated": "2022-03-24T17:09:01.240+03:00",
"childAdditions": {
"self": "value",
"references": []
},
"name": "PS_MQTT1",
"assetParents": {
"self": "value",
"references": []
},
"self": "value",
"id": "338",
"Building": "value"
},
{
"creationTime": "2021-08-02T13:06:09.834Z",
"type": " c8y_MQTTdevice",
"lastUpdated": "2021-12-27T12:08:20.186+03:00",
"childAdditions": {
"self": "value",
"references": []
},
"name": "FS_MQTT2",
"assetParents": {
"self": "value",
"references": []
},
"self": "value",
"id": "339",
"c8y_IsDevice": {}
},
{
"creationTime": "2021-08-02T13:06:39.602Z",
"type": " c8y_MQTTdevice",
"lastUpdated": "2021-12-27T12:08:20.433+03:00",
"childAdditions": {
"self": "value",
"references": []
},
"name": "PS_MQTT3",
"assetParents": {
"self": "value",
"references": []
},
"self": "value",
"id": "340",
"c8y_IsDevice": {}
}
],
"statistics": {
"totalPages": 423,
"currentPage": 1,
"pageSize": 3
}
}
As per my understanding I can access name key using r2_json['managedObjects'][0]['name']
But how do I iterate over this json and store all values of name inside an array?
EDIT 1 :
Another thing which I'm trying to achieve is get all id from the JSON data and store in an array where the nested dict managedObjects contains name starting with PS_ only.
Therefore, the expected output would be device_id = ['338','340']
You should not just call the [0] index of the list, but loop over it:
all_names = []
for object in r2_json['managedObjects']:
all_names.append(object['name'])
print(all_names)
edit: Updated answer after OP updated theirs.
For your second question you can use startswith(). The code is almost the same.
PS_names = []
for object in r2_json['managedObjects']:
if object['name'].startswith("PS_"):
PS_names.append(object['id']) # we append with the id, if startswith("PS_") returns True.
print(PS_names)

Python Cerberus embed numeric config data in schema

I have a set of documents and schemas I am doing validation against (shocker).
These documents are JSON messages from various different clients that use various different formats, thus a schema is defined for each document/message received from these clients.
I want to use a dispatcher (dictionary with function calls as values) to help perform the mapping/formatting of a document after it is validated against a matching schema.
Once I know the schema a message is valid against, I can then create the desired message payload for my various consumer services by calling the requisite mapping function.
To this end I need a key in my dispatcher which uniquely maps to its respective mapping function for that schema. The key also needs to be used to identify a schema so the correct mapping function can be called.
My question is this: Is there a way to embed a config value like a numeric ID into a schema?
I want to take this schema:
schema = {
"timestamp": {"type": "number"},
"values": {
"type": "list",
"schema": {
"type": "dict",
"schema": {
"id": {"required": True, "type": "string"},
"v": {"required": True, "type": "number"},
"q": {"type": "boolean"},
"t": {"required": True, "type": "number"},
},
},
},
}
And add a schema_id like this:
schema = {
"schema_id": 1,
"timestamp": {"type": "number"},
"values": {
"type": "list",
"schema": {
"type": "dict",
"schema": {
"id": {"required": True, "type": "string"},
"v": {"required": True, "type": "number"},
"q": {"type": "boolean"},
"t": {"required": True, "type": "number"},
},
},
},
}
So after successful validation, a link between message/document, to the schema via schema_id to the resulting mapping_function in the dispatcher is created.
Something like this:
mapping_dispatcher = {1: map_function_1, 2: map_function_2...}
if Validator.validate(document, schema) is True:
id = schema["schema_id"]
formatted_message = mapping_dispatcher[id](document)
A last ditch effort could be to simply stringify the json schemas and use those as keys but I'm not sure how I feel about that (it feels clever but wrong)...
I could also be going about this all wrong and there's a smarter way to do it.
Thanks!
small update
I've hacked around it by stringifying the schema, converting to bytes, then hex, then adding the integer values together like so:
schema_id = 0
bytes_schema = str.encode(schema)
hex_schema = codecs.encode(bytes_schema, "hex")
for char in hex_schema:
schema_id += int(char)
>>>schema_id
36832
So instead of a hash function I just embedded the schema in another json object that held the info like so:
[
{
"schema_id": "3",
"schema": {
"deviceName": {
"type": "string"
},
"tagName": {
"required": true,
"type": "string"
},
"deviceID": {
"type": "string"
},
"success": {
"type": "boolean"
},
"datatype": {
"type": "string"
},
"timestamp": {
"required": true,
"type": "number"
},
"value": {
"required": true,
"type": "number"
},
"registerId": {
"type": "string"
},
"description": {
"type": "string"
}
}
}
]
Was overthinking it I guess.

How to add data to a topic using AvroProducer

I have a topic with the following schema. Could someone help me out on how to add data to the different fields.
{
"name": "Project",
"type": "record",
"namespace": "abcdefg",
"fields": [
{
"name": "Object",
"type": {
"name": "Object",
"type": "record",
"fields": [
{
"name": "Number_ID",
"type": "int"
},
{
"name": "Accept",
"type": "boolean"
}
]
}
},
{
"name": "DataStructureType",
"type": "string"
},
{
"name": "ProjectID",
"type": "string"
}
]
}
I tried the following code. I get list is not iterable or list is out of range.
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
AvroProducerConf = {'bootstrap.servers': 'localhost:9092','schema.registry.url': 'http://localhost:8081'}
value_schema = avro.load('project.avsc')
avroProducer = AvroProducer(AvroProducerConf, default_value_schema = value_schema)
while True:
avroProducer.produce(topic = 'my_topic', value = {['Object'][0] : "value", ['Object'] [1] : "true", ['DataStructureType'] : "testvalue", ['ProjectID'] : "123"})
avroProducer.flush()
It's not clear what you're expecting something like this to do... ['Object'][0] and keys of a dict cannot be lists.
Try sending this, which matches your Avro schema
value = {
'Object': {
"Number_ID", 1,
"Accept": true
},
'DataStructureType' : "testvalue",
'ProjectID' : "123"
}

Validate JSON in Python 3.6 where number of inner blob is variable?

I have a JSON, which I need to validate if it is in the right format, before storing. I saw the jsonschema package and questions in SO related to it, but in my case, there can be any number of inner content in the list 'Data' which is also the root. How should my code be, to do the needful? Below code block gives a 'pass' to the wrong data (improper structure) too:
json_document = '''{
"Data": [
{
"vol_no": "001",
"loc": "2341",
"ts": "2016-02-04 14:25:19.000000"
},
{
"vol_no": "023",
"loc": "4635",
"ts": "2016-02-02 01:14:38.000000"
}
]
}'''
schema = {
"type": "object",
"properties": {
"vol_no": {"type": "string"},
"loc": {"type": "number"},
"ts": {"type": "string"}
},
}
for idx, item in enumerate((json.loads(json_document))['Data']):
try:
print(item)
print(schema)
validate(item, schema)
print("Record #{}: OK\n".format(idx))
except jsonschema.exceptions.ValidationError as ve:
print("Record #{}: ERROR\n".format(idx))
print(str(ve) + "\n")
Improper structure example. I changed key name of first element and removed ts from second element of the array in JSON, but it too doesn't error out:
json_document = '''{
"Data": [
{
"abcdsddfwq": "001",
"loc": "2341",
"ts": "2016-02-04 14:25:19.000000"
},
{
"vol_no": "023",
"loc": "4635"
}
]
}'''
You can tell that data must be an array where all items are objects with some required properties:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"vol_no": "string",
"loc": "string",
"ts": "string"
},
"required": ["vol_no", "loc", "ts"]
}
}

Accessing nested json objects using python

I am trying to interact with an API and running into issues accessing nested objects. Below is sample json output that I am working with.
{
"results": [
{
"task_id": "22774853-2b2c-49f4-b044-2d053141b635",
"params": {
"type": "host",
"target": "54.243.80.16",
"source": "malware_analysis"
},
"v": "2.0.2",
"status": "success",
"time": 227,
"data": {
"details": {
"as_owner": "Amazon.com, Inc.",
"asn": "14618",
"country": "US",
"detected_urls": [],
"resolutions": [
{
"hostname": "bumbleride.com",
"last_resolved": "2016-09-15 00:00:00"
},
{
"hostname": "chilitechnology.com",
"last_resolved": "2016-09-16 00:00:00"
}
],
"response_code": 1,
"verbose_msg": "IP address in dataset"
},
"match": true
}
}
]
}
The deepest I am able to access is the data portion which returns too much.... ideally I am just trying access as_owner,asn,country,detected_urls,resolutions
When I try to access details / response code ... etc I will get a KeyError. My nested json goes deeper then other Q's mentioned and I have tried that logic.
Below is my current code snippet and any help is appreciated!
import requests
import json
headers = {
'Content-Type': 'application/json',
}
params = (
('wait', 'true'),
)
data = '{"target":{"one":{"type": "ip","target": "54.243.80.16", "sources": ["xxx","xxxxx"]}}}'
r=requests.post('https://fakewebsite:8000/api/services/intel/lookup/jobs', headers=headers, params=params, data=data, auth=('apikey', ''))
parsed_json = json.loads(r.text)
#results = parsed_json["results"]
for item in parsed_json["results"]:
print(item['data'])
You just need to index correctly into the converted JSON. Then you can easily loop over a list of the keys you want to fetch, since they are all in the "details" dictionary.
import json
raw = '''\
{
"results": [
{
"task_id": "22774853-2b2c-49f4-b044-2d053141b635",
"params": {
"type": "host",
"target": "54.243.80.16",
"source": "malware_analysis"
},
"v": "2.0.2",
"status": "success",
"time": 227,
"data": {
"details": {
"as_owner": "Amazon.com, Inc.",
"asn": "14618",
"country": "US",
"detected_urls": [],
"resolutions": [
{
"hostname": "bumbleride.com",
"last_resolved": "2016-09-15 00:00:00"
},
{
"hostname": "chilitechnology.com",
"last_resolved": "2016-09-16 00:00:00"
}
],
"response_code": 1,
"verbose_msg": "IP address in dataset"
},
"match": true
}
}
]
}
'''
parsed_json = json.loads(raw)
wanted = ['as_owner', 'asn', 'country', 'detected_urls', 'resolutions']
for item in parsed_json["results"]:
details = item['data']['details']
for key in wanted:
print(key, ':', json.dumps(details[key], indent=4))
# Put a blank line at the end of the details for each item
print()
output
as_owner : "Amazon.com, Inc."
asn : "14618"
country : "US"
detected_urls : []
resolutions : [
{
"hostname": "bumbleride.com",
"last_resolved": "2016-09-15 00:00:00"
},
{
"hostname": "chilitechnology.com",
"last_resolved": "2016-09-16 00:00:00"
}
]
BTW, when you fetch JSON data using requests there's no need to use json.loads: you can access the converted JSON using the .json method of the returned request object instead of using its .text attribute.
Here's a more robust version of the main loop of the above code. It simply ignores any missing keys. I didn't post this code earlier because the extra if tests make it slightly less efficient, and I didn't know that keys could be missing.
for item in parsed_json["results"]:
if not 'data' in item:
continue
data = item['data']
if not 'details' in data:
continue
details = data['details']
for key in wanted:
if key in details:
print(key, ':', json.dumps(details[key], indent=4))
# Put a blank line at the end of the details for each item
print()

Categories