I have the output of an elasticsearch query saved in a file. The first few lines looks like this:
{"took": 1,
"timed_out": false,
"_shards": {},
"hits": {
"total": 27,
"max_score": 6.5157733,
"hits": [
{
"_index": "dbgap_062617",
"_type": "dataset",
***"_id": "595189d15152c64c3b0adf16"***,
"_score": 6.5157733,
"_source": {
"dataAcquisition": {
"performedBy": "\n\t\tT\n\t\t"
},
"provenance": {
"ingestTime": "201",
},
"studyGroup": [
{
"Identifier": "1",
"name": "Diseas"
}
],
"license": {
"downloadURL": "http",
},
"study": {
"alternateIdentifiers": "yes",
},
"disease": {
"name": [
"Coronary Artery Disease"
]
},
"NLP_Fields": {
"CellLine": [],
"MeshID": [
"C0066533",
],
"DiseaseID": [
"C0010068"
],
"ChemicalID": [],
"Disease": [
"coronary artery disease"
],
"Chemical": [],
"Meshterm": [
"migen",
]
},
"datasetDistributions": [
{
"dateReleased": "20150312",
}
],
"dataset": {
"citations": [
"20032323"
],
**"description": "The Precoc.",**
**"title": "MIGen_ExS: PROCARDIS"**
},
.... and the list goes on with a bunch of other items ....
From all of these nodes I was interested in Unique _Ids, title, and description. So, I created a dictionary and extracted the parts that I was interested in using json. Here is my code:
import json
s={}
d=open('local file','w')
with open('localfile', 'r') as ready:
for line in ready:
test=json.loads(line, encoding='utf-8')
for i in (test['hits']['hits']):
for x in i:
s.setdefault(i['_id'], [i['_source']['dataset']
['description'], i['_source']['dataset']['title']])
for k, v in s.items():
d.write(k +'\t'+v[0] +'\t' + v[1] + '\n')
d.close()
Now, when I run it, it gives me a file with duplicated _Ids! Does not dictionary suppose to give me unique _Ids? In my original output file, I have lots of duplicated Ids that I wanted to get rid of them.
Also, I ran set() only on _ids to get unique number of them and it came to 138. But with dictionary if i remove generated duplicated ids it comes down to 17!
Can someone please tell me why this is happening?
If you want a unique ID, if you're using a database it will create it for you. If you're not, you'll need to generate a unique number or string. Depending on how the dictionaries are created, you could use the timestamp of when the dictionary was created, or you could use uuid.uuid4(). For more info on uuid, here are the docs.
Related
I have an example json file. I need to extract all the values of the downloadUrl keys:
{
"nodes": {
"children": [
{
"id": "",
"localizedName": "",
"name": "Documents",
"children": [
{
"id": "",
"localizedName": "Brochures",
"name": "Brochures",
"items": [
{
"title": "Brochure",
"downloadUrl": "/documents/brochure-en.pdf",
"fileType": "pdf",
"fileSize": "2.9 MB"
}
]
},
{
"id": "192194",
"localizedName": "",
"name": "Demonstrations",
"items": [
{
"title": "Safety Poster",
"downloadUrl": "safety-poster-en.pdf",
"fileType": "pdf",
"fileSize": "1.1 MB"
}
]
}
]
}
]
}
}
I'm trying to do this with this query:
jmespath.search('nodes[*].downloadUrl', file)
but the list of values is not displayed.
Where is the error?
Statically, your property is under
nodes
children
[ ]
children
[ ]
items
[ ]
downloadUrl
So a query giving you those values would be:
nodes.children[].children[].items[].downloadUrl
If you want something a little more dynamic (let's say that the property names can change but the level at which you will find downloadUrl won't, you could use this query:
*.*[][].*[][].*[?downloadUrl][][].downloadUrl
But sadly, something like querying in an arbitrary structure, like you can do it in jq is not something JMESPath supports at the moment.
You need to do something like.
.search(("nodes[*].children[*].items[*].downloadUrl"))
I have a multilevel/complex json file - twitter.json and I want to extract ONLY the author ID from this json file.
This is how my file 'twitter.json' looks:
[
[
{
"tweets_results": [
{
"meta": {
"result_count": 0
}
}
],
"youtube_link": "www.youtube.com/channel/UCl4GlGXR0ED6AUJU1kRhRzQ"
}
],
[
{
"tweets_results": [
{
"data": [
{
"author_id": "125959599",
"created_at": "2021-06-12T15:16:40.000Z",
"id": "1403732993269649410",
"in_reply_to_user_id": "125959599",
"lang": "pt",
"public_metrics": {
"like_count": 0,
"quote_count": 0,
"reply_count": 1,
"retweet_count": 0
},
"source": "Twitter for Android",
"text": "⌨️ Canais do YouTube:\n\n1 - Alexandre Garcia: Canal de Brasília"
},
{
"author_id": "521827796",
"created_at": "2021-06-07T20:23:08.000Z",
"id": "1401998177943834626",
"in_reply_to_user_id": "623794755",
"lang": "und",
"public_metrics": {
"like_count": 0,
"quote_count": 0,
"reply_count": 0,
"retweet_count": 0
},
"source": "TweetDeck",
"text": "#thelittlecouto"
}
],
"meta": {
"newest_id": "1426546114115870722",
"oldest_id": "1367808835403063298",
"result_count": 7
}
}
],
"youtube_link": "www.youtube.com/channel/UCm0yTweyAa0PwEIp0l3N_gA"
}
]
]
I have read through many similar SO questions (including but not limited to):
Access the key of a multilevel JSON file in python
Multilevel JSON Dictionary - can't extract key into new dictionary
How to extract a single value from JSON response?
how to get fields and values from a specific key of json file in python
How to select specific key/value of an object in json via python
Python: Getting all values of a specific key from json
But the structures of those jsons are pretty simple and when I try to replicate that, I hit errors.
From what I read, contents.tweets_results.data.author_id is how the reference would go. And I am loading using contents = json.load(open("twitter.json")). Any help is appreciated.
EDIT: Both #sammywemmy's and #balderman's code worked for me. I accepted #sammywemmy's because I used that code, but I wanted to credit them both in some way.
Your data has a path to it, You've got a list nested in a list, within the inner list, you have a tweets_results key, whose values is a list of dicts; one of them has a data key, which contains a list/array, which contains a dictionary, where one of the keys is author_id. We can simulate the path (sort of) as : '[][].tweets_results[].data[].author_id'
A rehash sort of : Hit the First list, then the inner list, then access the tweets_results key, then access the list of values; within that list of values, access the data key, within the list of values associated with data, access the author_id:
With this path, one can use jmespath to pull out the author_ids :
# pip install jmespath
import jmespath
# similar to re.compile
expression = jmespath.compile('[][].tweets_results[].data[].author_id')
expression.search(data)
['125959599', '521827796']
jmespath is quite useful if you want to build a data structure from nested dicts; if however, you are only concerned with the values for author_id, you can use nested_lookup instead; it recursively searches for the keys and returns the values:
# pip install nested-lookup
from nested_lookup import nested_lookup
nested_lookup('author_id', data)
['125959599', '521827796']
See below (no external lib is involved)
data = [
[
{
"tweets_results": [
{
"meta": {
"result_count": 0
}
}
],
"youtube_link": "www.youtube.com/channel/UCl4GlGXR0ED6AUJU1kRhRzQ"
}
],
[
{
"tweets_results": [
{
"data": [
{
"author_id": "125959599",
"created_at": "2021-06-12T15:16:40.000Z",
"id": "1403732993269649410",
"in_reply_to_user_id": "125959599",
"lang": "pt",
"public_metrics": {
"like_count": 0,
"quote_count": 0,
"reply_count": 1,
"retweet_count": 0
},
"source": "Twitter for Android",
"text": "⌨️ Canais do YouTube:\n\n1 - Alexandre Garcia: Canal de Brasília"
},
{
"author_id": "521827796",
"created_at": "2021-06-07T20:23:08.000Z",
"id": "1401998177943834626",
"in_reply_to_user_id": "623794755",
"lang": "und",
"public_metrics": {
"like_count": 0,
"quote_count": 0,
"reply_count": 0,
"retweet_count": 0
},
"source": "TweetDeck",
"text": "#thelittlecouto"
}
],
"meta": {
"newest_id": "1426546114115870722",
"oldest_id": "1367808835403063298",
"result_count": 7
}
}
],
"youtube_link": "www.youtube.com/channel/UCm0yTweyAa0PwEIp0l3N_gA"
}
]
]
ids = []
for entry in data:
for sub in entry:
result = sub['tweets_results']
if result[0].get('data'):
info = result[0]['data']
for item in info:
ids.append(item.get('author_id','not_found'))
print(ids)
output
['125959599', '521827796']
I have this DEVICE collection
[
{
"_id": ObjectId("60265a12f9bf1e3974dabe56"),
"Name": "Device",
"Configuration_ids": [
ObjectId("60265a11f9bf1e3974dabe54"),
ObjectId("60265a11f9bf1e3974dabe55")
]
},
{
"_id": ObjectId("60265a92f9bf1e3974dabe64"),
"Name": "Device2",
"Configuration_ids": [
ObjectId("60265a92f9bf1e3974dabe5a"),
ObjectId("60265a92f9bf1e3974dabe5b")
]
},
{
"_id": ObjectId("60265a92f9bf1e3974dabe65"),
"Name": "Device3",
"Configuration_ids": [
ObjectId("60265a92f9bf1e3974dabe5e"),
ObjectId"60265a92f9bf1e3974dabe5f")
]
}
]
I need to update all the documents that match the list of device ids. and push each element in a configuration_ids given list in each matched device. the 2 lists are the same lenght.
my solution is written in the following, but I can do it in one single query?
device_ids=[
ObjectId("60265a12f9bf1e3974dabe56"),
ObjectId("60265a92f9bf1e3974dabe64"),
ObjectId("60265a92f9bf1e3974dabe65")
]
configuration_ids = [
ObjectId("60267d14bc2f40d0dec1de3b"),
ObjectId("60267d14bc2f40d0dec1de3c"),
ObjectId("60267d14bc2f40d0dec1de3d")
]
for i in range(0, len(device_ids)):
update_devices = device_collection.update_one(
{'_id': ObjectId(device_ids[i])},
{'$push': {'Configuration_ids': configuration_ids[i]}}
)
The result:
[
{
"_id": ObjectId("60265a12f9bf1e3974dabe56"),
"Name": "Device",
"Configuration_ids": [
ObjectId("60265a11f9bf1e3974dabe54"),
ObjectId("60265a11f9bf1e3974dabe55"),
ObjectId("60267d14bc2f40d0dec1de3b")
]
},
{
"_id": ObjectId("60265a92f9bf1e3974dabe64"),
"Name": "Device2",
"Configuration_ids": [
ObjectId("60265a92f9bf1e3974dabe5a"),
ObjectId("60265a92f9bf1e3974dabe5b"),
ObjectId("60267d14bc2f40d0dec1de3c")
]
},
{
"_id": ObjectId("60265a92f9bf1e3974dabe65"),
"Name": "Device3",
"Configuration_ids": [
ObjectId("60265a92f9bf1e3974dabe5e"),
ObjectId"60265a92f9bf1e3974dabe5f"),
ObjectId("60267d14bc2f40d0dec1de3d")
]
}
]
If you were hoping to use update_many to achieve this in a single update, then the short answer is you can't. update_many takes a single filter to determine which documents to update; in your example, each update is a different document id.
If you have a large number of these updates, and performance is an issue, consider using the bulk write operators.
I am trying to get the values of the properties in JSON but I'm having a hard time fetching the ones inside an object array.
I have a function that gets a test JSON which has these lines of code:
def get_test_body() -> str:
directory = str(pathlib.Path(__file__).parent.parent.as_posix())
f = open(directory + '/tests/json/test.json', "r")
body = json.loads(f.read())
f.close()
return body
This is the first half of the JSON file (modified the names):
"id": "112358",
"name": "test",
"source_type": "SqlServer",
"connection_string_name": "123134-SQLTest-ConnectionString",
"omg_test": "12312435-123123-41232b5-asd123-1232145",
"triggers": [
{
"frequency": "Day",
"interval": 1,
"start_time": "2019-06-17T21:37:00",
"end_time": "2019-06-18T21:37:00",
"schedule": [
{
"hours": [
2
],
"minutes": [
0
],
"week_days": [],
"month_days": [],
"monthly_occurrences": []
}
]
}
]
The triggers has more objects within it I couldn't figure out the syntax for it.
I am then able to fetch the some of the data using:
name = body['name']
But I couldn't fetch anything under the triggers Array. I tried using body['triggers']['frequency'] and even ['triggers'][0] (lol) but I couldn't get it to work. I'm fairly new to Python any help would be appreciated!
I getting the right output, even with bwhat you did?
import json
string = """
{
"id": "112358",
"name": "test",
"source_type": "SqlServer",
"connection_string_name": "123134-SQLTest-ConnectionString",
"omg_test": "12312435-123123-41232b5-asd123-1232145",
"triggers": [
{
"frequency": "Day",
"interval": 1,
"start_time": "2019-06-17T21:37:00",
"end_time": "2019-06-18T21:37:00",
"schedule": [
{
"hours": [
2
],
"minutes": [
0
],
"week_days": [],
"month_days": [],
"monthly_occurrences": []
}
]
}
]
}
"""
str_dict = json.loads(string)
print(str_dict["triggers"][0]["frequency"])
Giving me Day
Here is a sample data from a csv file, where every generation is children of previous generation.
parant,gen1,gen2,get3,gen4,gen5,gen6
query1,AggregateExpression,abc,def,emg,cdf,bcf
query1,And,cse,rds,acd,,
query2,Arithmetic,cbd,rsd,msd,,
query2,Average,as,vs,ve,ew,
query2,BinaryExpression,avsd,sfds,sdf,,
query2,Comparison,sdfs,sdfsx,,,
query3,Count,sfsd,,,,
query3,methods1,add,asd,fdds,sdf,sdf
query3,methods1,average,sdfs,bf,fd,
query4,methods2,distinct,cz,asd,ada,
query4,methods2,eq,sdfs,sdfxcv,sdf,rtyr
query4,methods3,eq,vcx,xcv,cdf,
I need to create a json file of following format, where parents are the index and children are always list of dictionaries and there is a size for the last generation which is calculated no. of time their parent appear (in previous generation).
Example of the first row breakdown:
{
"name": "query1",
"children": [
{
"name": "AggregateExpression",
"children": [
{
"name": "abc",
"children": [
{
"name": "def",
"children": [
{
"name": "emg",
"children": [
{
"name": "cdf",
"children": [
{
"name": "bcf", "size": 1
}
]
}
]
}
]
}
]
}
]
}
]
}
I have tried to use groupby() and to_json() but was not able to complete. But still struggling to build the logic if I need to use lambda or looping. Any suggestion or solution is welcome. Thanks.