how to search in n element elastic-search-dsl - python

I've a problem in elastic search to know if an element already exist, because for a lot of practical reason I store in elastic-search meta data of image for each user (the author) and to avoid some attack I need to be sure that the image hasn't be already saved. here is a standard json file :
{
"user_group": "user",
"user_data": {
"name": "myname"
},
"user_image": [
{
"size": "1920x1080",
"location": "my_city",
"description": "my_desc",
"name": "myname",
"md5_checksum_image": "a_md5"
},
{
"size": "1920x1080",
"location": "my_city",
"description": "my_desc",
"name": "myothername",
"md5_checksum_image": "a_md5"
}
]
}
my problem is that I could have thousand of image so ho could I test with a query if I've already registered this image by looking at the checksum ?

Related

How to load json nested data into bigquery

I'm trying to load the json data from an API into bigquery table on GCP however I got an issue that the json data seem to miss a square bracket so it got an error '"Repeated record with name trip_update added outside of an array."}]'. I don't know how
Here is the data sample:
{
"header": {
"gtfs_realtime_version": "1.0",
"timestamp": 1607630971
},
"entity": [
{
"id": "65.5.17-120-cm1-1.18.O",
"trip_update": {
"trip": {
"trip_id": "65.5.17-120-cm1-1.18.O",
"start_time": "18:00:00",
"start_date": "20201210",
"schedule_relationship": "SCHEDULED",
"route_id": "17-120-cm1-1"
},
"stop_time_update": [
{
"stop_sequence": 1,
"departure": {
"delay": 0
},
"stop_id": "8220B1351201",
"schedule_relationship": "SCHEDULED"
},
{
"stop_sequence": 23,
"arrival": {
"delay": 2340
},
"departure": {
"delay": 2340
},
"stop_id": "8260B1025301",
"schedule_relationship": "SCHEDULED"
}
]
}
}
]
}
Here is a schema and code:
schema
[
{ "name":"header",
"type": "record",
"fields": [
{ "name":"gtfs_realtime_version",
"type": "string",
"description": "version of speed specification"
},
{ "name": "timestamp",
"type": "integer",
"description": "The moment where this dataset was generated on server e.g. 1593102976"
}
]
},
{"name":"entity",
"type": "record",
"mode": "REPEATED",
"description": "Multiple entities can be included in the feed",
"fields": [
{"name":"id",
"type": "string",
"description": "unique identifier for the entity"
},
{"name": "trip_update",
"type": "struct",
"mode": "REPEATED",
"description": "Data about the realtime departure delays of a trip. At least one of the fields trip_update, vehicle, or alert must be provided - all these fields cannot be empty.",
"fields": [
{ "name":"trip",
"type": "record",
"mode": "REPEATED",
"fields": [
{"name": "trip_id",
"type": "string",
"description": "selects which GTFS entity (trip) will be affected"
},
{ "name":"start_time",
"type": "string",
"description": "The initially scheduled start time of this trip instance 13:30:00"
},
{ "name":"start_date",
"type": "string",
"description": "The start date of this trip instance in YYYYMMDD format. Whether start_date is required depends on the type of trip: e.g. 20200625"
},
{ "name":"schedule_relationship",
"type": "string",
"description": "The relation between this trip and the static schedule e.g. SCHEDULED"
},
{ "name":"route_id",
"type": "string",
"description": "The route_id from the GTFS feed that this selector refers to e.g. 10-263-e16-1"
}
]
}
]
},
{ "name":"stop_time_update",
"type": "record",
"mode": "REPEATED",
"description": "Updates to StopTimes for the trip (both future, i.e., predictions, and in some cases, past ones, i.e., those that already happened). The updates must be sorted by stop_sequence, and apply for all the following stops of the trip up to the next specified stop_time_update. At least one stop_time_update must be provided for the trip unless the trip.schedule_relationship is CANCELED - if the trip is canceled, no stop_time_updates need to be provided.",
"fields": [
{"name":"stop_sequence",
"type": "string",
"description": "Must be the same as in stop_times.txt in the corresponding GTFS feed e.g 3"
},
{ "name":"arrival",
"type": "record",
"mode": "REPEATED",
"fields": [
{ "name":"delay",
"type": "string",
"description": "Delay (in seconds) can be positive (meaning that the vehicle is late) or negative (meaning that the vehicle is ahead of schedule). Delay of 0 means that the vehicle is exactly on time e.g 5"
}
]
},
{ "name": "departure",
"type": "record",
"mode": "REPEATED",
"fields": [
{ "name":"delay",
"type": "integer"
}
]
},
{ "name":"stop_id",
"type": "string",
"description": "Must be the same as in stops.txt in the corresponding GTFS feed e.g. 8430B2552301"
},
{"name":"schedule_relationship",
"type": "string",
"description": "The relation between this StopTime and the static schedule e.g. SCHEDULED , SKIPPED or NO_DATA"
}
]
}
]
}
]
function (following google guideline https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions?authuser=2#before-you-begin)
def _insert_into_bigquery(bucket_name, file_name):
blob = CS.get_bucket(bucket_name).blob(file_name)
row = json.loads(blob.download_as_string())
table = BQ.dataset(BQ_DATASET).table(BQ_TABLE)
errors = BQ.insert_rows_json(table,
json_rows=row,
ignore_unknown_values=True,
retry=retry.Retry(deadline=30))
if errors != []:
raise BigQueryError(errors)
Your schema definition is wrong. trip_update isn't a struct repeated, but a record nullable (or not, but not repeated)
{"name": "trip_update",
"type": "record",
"mode": "NULLABLE",
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
I believe that your "trip_update" and "trip" fields must contain an array of values (indicated by square brackets), the same as you did for "stop_time_update".
"trip_update": [
{
"trip": [
{
"trip_id
I am not sure that will be enough though to load your data flawlessly.
Your example row has many newline characters in the middle of your JSON row, and when you are loading data from JSON files, the rows must be newline delimited. BigQuery expects newline-delimited JSON files to contain a single record per line (the parser is trying to interpret each line as a separate JSON row) (Reference).
Example of how your JSON data file should look like.

JSON EXtraction in Python

I am trying to extract a specific part of the JSON but I keep on getting errors.
I am interested in the following sections:
"field": "tag",
"value": "Wian",
I can extract the entire filter section using:
for i in range(0,values_num):
dedata[i]['filter']
But if I try to filter beyond that point I just get errors.
Could someone please assist me with this?
Here is the JSON output style:
{
"mod_time": 1594631137499,
"description": "",
"id": 82,
"name": "Wian",
"include_custom_devices": true,
"dynamic": true,
"field": null,
"value": null,
"filter": {
"rules": [
{
"field": "tag",
"operand": {
"value": "Wian",
"is_regex": false
},
"operator": "~"
}
],
"operator": "and"
}
}
You are probably trying to access the data in rules but since its an array, you have to specifically access that array by getting the [0] index.
You could simplistically just use .get('<name>') as shown below:
dedata['filter']['rules'][0].get('field'))
Likewise for value:
dedata[i]['filter']['rules'][0]['operand'].get('value')
comment out the for loop and try without it and [i] and see if it works

I need help figuring out how to turn online data into a usable list that I can print data from

In a program I am working on, I use ArcCloud's music fingerprinting service. after uploading the data I need identified, I am given back this piece of data:
re = ACRCloudRecognizer(config)
data = (re.recognize_by_file('audio_name.mp3', 0))
>>>data
'{"metadata":{"timestamp_utc":"2020-05-18 23:00:59","music":[{"label":"NoCopyrightSounds","play_offset_ms":125620,"duration_ms":326609,"external_ids":{},"artists":[{"name":"Culture Code & Regoton"}],"result_from":1,"acrid":"a53ea40c6a8b4a6795ac3d799f6a4aec","title":"Waking Up","genres":[{"name":"Electro"}],"album":{"name":"Waking Up"},"score":100,"external_metadata":{},"release_date":"2014-05-25"}]},"cost_time":5.5099999904633,"status":{"msg":"Success","version":"1.0","code":0},"result_type":0}\n'
I think it's a list, but I am unable to figure out how to navigate nor grab specific information from it. I'm unsure how they set up the information, and what patterns to look for. Ideally, I would like to create a print function that would print the title, artists, and album.
Any help is much appreciated!
Formatting the JSON makes it more legible
{
"metadata": {
"timestamp_utc": "2020-05-18 23:00:59",
"music": [
{
"label": "NoCopyrightSounds",
"play_offset_ms": 125620,
"duration_ms": 326609,
"external_ids": {},
"artists": [
{
"name": "Culture Code & Regoton"
}
],
"result_from": 1,
"acrid": "a53ea40c6a8b4a6795ac3d799f6a4aec",
"title": "Waking Up",
"genres": [
{
"name": "Electro"
}
],
"album": {
"name": "Waking Up"
},
"score": 100,
"external_metadata": {},
"release_date": "2014-05-25"
}
]
},
"cost_time": 5.5099999904633,
"status": {
"msg": "Success",
"version": "1.0",
"code": 0
},
"result_type": 0
}
Looks like you're looking for .metadata.music.title (presumably), but only if .status.code is 0

How to insert large documents in mongodb

I want to store a large JSON document larger than 16MB (which is the size limit per document) in MongoDB, but due to the size limit i am unable to do so. How can I store such large documents in MongoDB? I know GridFS API can be an option, but after a lot of struggle, I am still unable to figure out how to use GridFS and what are the right commands to insert and retrieve data using GridFS. Any help in using GridFS or any other alternative to store large JSON documents would be much appreciated.
I am using Python's PyMongo package.
Thanks!
There are so many methods to store large data in MongoDB
var Data = {
"userID": "1",
"userData": {
"firstName": "Test Firest Name",
"lastName": "Test Last Name",
"number":{
"phNumber": "9999999991",
"cellNumber": "8888888888",
},
"address": {
"Geo": {
"latitude": 15.40,
"longtitude": -70.90
},
"city": "surat",
"state": "gujarat",
"contry": "india"
},
"product": {
"game": {
"GTA": true,
"DOTA": true
},
"television": {
"TV": true,
"PlayStation": true
"Xbox": false
}
},
},
"key": "ANbcsgYSIDncsSK"
};
db.collection("[collection Name]").insertMany(Data)

python querying a json objectpath

I've a nested json structure, I'm using objectpath (python API version), but I don't understand how to select and filter some information (more precisely the nested information in the structure).
EG.
I want to select the "description" of the action "reading" for the user "John".
JSON:
{
"user":
{
"actions":
[
{
"name": "reading",
"description": "blablabla"
}
]
"name": "John"
}
}
CODE:
$.user[#.name is 'John' and #.actions.name is 'reading'].actions.description
but it doesn't work (empty set but in my JSON it isn't so).
Any suggestion?
Is this what you are trying to do?
import objectpath
data = {
"user": {
"actions": {
"name": "reading",
"description": "blablabla"
},
"name": "John"
}
}
tree = objectpath.Tree(data)
result = tree.execute("$.user[#.name is 'John'].actions[#.name is 'reading'].description")
for entry in result:
print entry
Output
blablabla
I had to fix your JSON. Also, tree.execute returns a generator. You could replace the for loop with print result.next(), but the for loop seemed more clear.
import objectpath import *
your_json = {"name": "felix", "last_name": "diaz"}
# This json path will bring all the key-values of your json
your_json_path='$.*'
my_key_values = Tree(your_json).execute(your_json_path)
# If you want to retrieve the name node...then specify it.
my_name= Tree(your_json).execute('$.name')
# If you want to retrieve a the last_name node...then specify it.
last_name= Tree(your_json).execute('$.last_name')
I believe you're just missing a comma in JSON:
{
"user":
{
"actions": [
{
"name": "reading",
"description": "blablabla"
}
],
"name": "John"
}
}
Assuming there is only one "John", with only one "reading" activity, the following query works:
$.user[#.name is 'John'].actions[0][#.name is 'reading'][0].description
If there could be multiple "John"s, with multiple "reading" activities, the following query will almost work:
$.user.*[#.name is 'John'].actions..*[#.name is 'reading'].description
I say almost because the use of .. will be problematic if there are other nested dictionaries with "name" and "description" entries, such as
{
"user": {
"actions": [
{
"name": "reading",
"description": "blablabla",
"nested": {
"name": "reading",
"description": "broken"
}
}
],
"name": "John"
}
}
To get a correct query, there is an open issue to correctly implement queries into arrays: https://github.com/adriank/ObjectPath/issues/60

Categories