Extracting Key from multilevel (scraped) complex structure json file in python - python

I have a multilevel/complex json file - twitter.json and I want to extract ONLY the author ID from this json file.
This is how my file 'twitter.json' looks:
[
[
{
"tweets_results": [
{
"meta": {
"result_count": 0
}
}
],
"youtube_link": "www.youtube.com/channel/UCl4GlGXR0ED6AUJU1kRhRzQ"
}
],
[
{
"tweets_results": [
{
"data": [
{
"author_id": "125959599",
"created_at": "2021-06-12T15:16:40.000Z",
"id": "1403732993269649410",
"in_reply_to_user_id": "125959599",
"lang": "pt",
"public_metrics": {
"like_count": 0,
"quote_count": 0,
"reply_count": 1,
"retweet_count": 0
},
"source": "Twitter for Android",
"text": "⌨️ Canais do YouTube:\n\n1 - Alexandre Garcia: Canal de Brasília"
},
{
"author_id": "521827796",
"created_at": "2021-06-07T20:23:08.000Z",
"id": "1401998177943834626",
"in_reply_to_user_id": "623794755",
"lang": "und",
"public_metrics": {
"like_count": 0,
"quote_count": 0,
"reply_count": 0,
"retweet_count": 0
},
"source": "TweetDeck",
"text": "#thelittlecouto"
}
],
"meta": {
"newest_id": "1426546114115870722",
"oldest_id": "1367808835403063298",
"result_count": 7
}
}
],
"youtube_link": "www.youtube.com/channel/UCm0yTweyAa0PwEIp0l3N_gA"
}
]
]
I have read through many similar SO questions (including but not limited to):
Access the key of a multilevel JSON file in python
Multilevel JSON Dictionary - can't extract key into new dictionary
How to extract a single value from JSON response?
how to get fields and values from a specific key of json file in python
How to select specific key/value of an object in json via python
Python: Getting all values of a specific key from json
But the structures of those jsons are pretty simple and when I try to replicate that, I hit errors.
From what I read, contents.tweets_results.data.author_id is how the reference would go. And I am loading using contents = json.load(open("twitter.json")). Any help is appreciated.
EDIT: Both #sammywemmy's and #balderman's code worked for me. I accepted #sammywemmy's because I used that code, but I wanted to credit them both in some way.

Your data has a path to it, You've got a list nested in a list, within the inner list, you have a tweets_results key, whose values is a list of dicts; one of them has a data key, which contains a list/array, which contains a dictionary, where one of the keys is author_id. We can simulate the path (sort of) as : '[][].tweets_results[].data[].author_id'
A rehash sort of : Hit the First list, then the inner list, then access the tweets_results key, then access the list of values; within that list of values, access the data key, within the list of values associated with data, access the author_id:
With this path, one can use jmespath to pull out the author_ids :
# pip install jmespath
import jmespath
# similar to re.compile
expression = jmespath.compile('[][].tweets_results[].data[].author_id')
expression.search(data)
['125959599', '521827796']
jmespath is quite useful if you want to build a data structure from nested dicts; if however, you are only concerned with the values for author_id, you can use nested_lookup instead; it recursively searches for the keys and returns the values:
# pip install nested-lookup
from nested_lookup import nested_lookup
nested_lookup('author_id', data)
['125959599', '521827796']

See below (no external lib is involved)
data = [
[
{
"tweets_results": [
{
"meta": {
"result_count": 0
}
}
],
"youtube_link": "www.youtube.com/channel/UCl4GlGXR0ED6AUJU1kRhRzQ"
}
],
[
{
"tweets_results": [
{
"data": [
{
"author_id": "125959599",
"created_at": "2021-06-12T15:16:40.000Z",
"id": "1403732993269649410",
"in_reply_to_user_id": "125959599",
"lang": "pt",
"public_metrics": {
"like_count": 0,
"quote_count": 0,
"reply_count": 1,
"retweet_count": 0
},
"source": "Twitter for Android",
"text": "⌨️ Canais do YouTube:\n\n1 - Alexandre Garcia: Canal de Brasília"
},
{
"author_id": "521827796",
"created_at": "2021-06-07T20:23:08.000Z",
"id": "1401998177943834626",
"in_reply_to_user_id": "623794755",
"lang": "und",
"public_metrics": {
"like_count": 0,
"quote_count": 0,
"reply_count": 0,
"retweet_count": 0
},
"source": "TweetDeck",
"text": "#thelittlecouto"
}
],
"meta": {
"newest_id": "1426546114115870722",
"oldest_id": "1367808835403063298",
"result_count": 7
}
}
],
"youtube_link": "www.youtube.com/channel/UCm0yTweyAa0PwEIp0l3N_gA"
}
]
]
ids = []
for entry in data:
for sub in entry:
result = sub['tweets_results']
if result[0].get('data'):
info = result[0]['data']
for item in info:
ids.append(item.get('author_id','not_found'))
print(ids)
output
['125959599', '521827796']

Related

Django case insensitive search in multilevel jsonb field using ORM methods

here is my sample jsonb field:
{
"name": "XXXXX",
"duedate": "Wed Aug 31 2022 17:23:13 GMT+0530",
"structure": {
"sections": [
{
"id": "0",
"temp_id": 9,
"expanded": true,
"requests": [
{
"title": "entity onboarding form", # I need to lookup at this level (or key)
"agents": [
{
"email": "ak#xxx.com",
"user_name": "Akhild",
"review_status": 0
}
],
"req_id": "XXXXXXXX",
"status": "Requested"
},
{
"title": "onboarding", # I need to lookup at this level (or key)
"agents": [
{
"email": "adak#xxx.com",
"user_name": "gaajj",
"review_status": 0
}
],
"req_id": "XXXXXXXX",
"status": "Requested"
}
],
"labelText": "Pan Card"
}
]
},
"Agentnames": "",
"clientowners": "Admin",
"collectionname": "Bank_duplicate"
}
In this JSON i need to do case insensitive match for structure->section->request(array)-> title inside each object of request array
I have tried this query filter
(Q(requests__structure__sections__contains=[{'requests':[{"title": query}]}]))
but it becomes case sensitive. Also i have tried
self.get_queryset().annotate(search=SearchVector(Cast('requests__structure__sections', TextField()))
which does gives case insensitive result but also lookup among the keys other than title.
also i tried raw sql where i cannot go beyond the request array.
Im expecting any other method or any other approach in django orm that can be used to achieve the require result.

Python - accessing a value within in a nested dictionary within a list

I have a JSON API response that looks like the following:
json_data=
{
"sales_list": [
{
"date": "all",
"country": "all",
"units": {
"product": {
"promotions": 0,
"downloads": 1,
"updates": 2,
"refunds": 3
},
"iap": {
"promotions": 0,
"sales": 0,
"refunds": 0
}
},
"revenue": {
"product": {
"promotions": "0.00",
"downloads": "0.00",
"updates": "0.00",
"refunds": "0.00"
},
"iap": {
"promotions": "0.00",
"sales": "0.00",
"refunds": "0.00"
},
"ad": "0.00"
}
}
],
"next_page": null,
"code": 200,
"prev_page": null,
"vertical": "apps",
"page_num": 1,
"iap_sales_list": [],
"currency": "USD",
"page_index": 0,
"market": "ios"
}
I'm using Python and am trying to access the first "downloads" value in the response. So I need to go from sales_list (list in a dict) > units (dict) > product (dict) > downloads. How to I got about digging down these multiple layers to access just this single value?
I've seen questions about accessing values within a dictionary within a list, or within a nested dictionary. But I'm a little confused as to how to navigate between/among lists in dictionaries and dictionaries in lists. Any help would be greatly appreciated.
similiar question: Python Accessing Nested JSON Data
is that what you need?
print(json_data['sales_list'][0]['units']['product']['downloads'])
it gives output 1
to answer your question:
as you see your json field sales_list is one-element list of dictionaries
[ {dictionary with field you need}, {other dict}, .. ]
because of that you need to specify index of list element you want to acces - in case of your one-element list it will be [0] because first element of your list contains field you need
>>> dict['sales_list'][0]['units']['products']['downloads']
1

Insert into a list only some keys of a Python nested dictionary

I'm using Spotipy for getting all the albums from an artist.
I have the following Python dictionary object for each query (one per artist queried):
{
"href": "https://api.spotify.com/v1/artists/006ibfxHXj6ewIkihKcaS2/albums?offset=0&limit=1&include_groups=album",
"items": [
{
"album_group": "album",
"album_type": "album",
"artists": [
{
"external_urls": {
"spotify": "https://open.spotify.com/artist/006ibfxHXj6ewIkihKcaS2"
},
"href": "https://api.spotify.com/v1/artists/006ibfxHXj6ewIkihKcaS2",
"id": "006ibfxHXj6ewIkihKcaS2",
"name": "Hello Meteor",
"type": "artist",
"uri": "spotify:artist:006ibfxHXj6ewIkihKcaS2"
}
],
"available_markets": [
"blabla"
],
"external_urls": {
"spotify": "https://open.spotify.com/album/19HZblBbWVWYVqiM0B9eW8"
},
"href": "https://api.spotify.com/v1/albums/19HZblBbWVWYVqiM0B9eW8",
"id": "19HZblBbWVWYVqiM0B9eW8",
"images": [
{
"height": 640,
"url": "https://i.scdn.co/image/8c249db0add94460c7e61e994e7ac3f8f1abddd9",
"width": 640
},
{
"height": 300,
"url": "https://i.scdn.co/image/03ff6bd7c00fd58b167a4f3bc5529e5d17bf7ee1",
"width": 300
},
{
"height": 64,
"url": "https://i.scdn.co/image/151539b29846c6ae9b68c628e639d66277349468",
"width": 64
}
],
"name": "Mu & Mea",
"release_date": "2018-07-17",
"release_date_precision": "day",
"total_tracks": 15,
"type": "album",
"uri": "spotify:album:19HZblBbWVWYVqiM0B9eW8"
}
],
"limit": 1,
"next": "https://api.spotify.com/v1/artists/006ibfxHXj6ewIkihKcaS2/albums?offset=1&limit=1&include_groups=album",
"offset": 0,
"previous": null,
"total": 6
}
I have the following line of code that adds all items object to the list:
albums.extend(sp.artist_albums(artist, album_type='album', limit=1)['items'] for artist in artists)
The problem is that I only need two of the endless keys that that returns; I only need the album title and release date. The output I would like to have is a list:
[['album name 1', 'release_date1'], ['album name2'', release_date2'], ...]
Rather than add the ['items'] list (which only contains a single album, if I understand your limit=1 query correctly), add a new dictionary with the specific values.
To avoid having to call the Spotify API twice for those two items, put your query loop into a generator expression; that makes it easier to then take the resulting album dictionary and take out specific keys:
results = (result for artist in artists
for result in sp.artist_albums(artist, album_type='album', limit=1)['items'])
albums.extend([r['name'], r['release_date']] for r in results)
Here, results is a lazily evaluating sequence of {'album_group': ..., 'album_type', ..., ...} dictionaries; these are all the albums in the 'items' list for each artist queried. There is only 1 for each artist here, but on the off-chance there might be zero albums, or you wanted to raise the limit value, I make sure to loop over the items.
The generator expression in albums.extend() then creates a new list object with two of the keys for each of those results.

How find data from JSON using python and watson discovery news

{
"matching_results": 1264,
"results": [
{
"main_image_url": "https://s4.reutersmedia.net/resources_v2/images/rcom-default.png",
"enriched_text": {
"entities": [
{
"relevance": 0.33,
"disambiguation": {
"subtype": [
"Country"
]
},
"sentiment": {
"score": 0
},
"type": "Location",
"count": 1,
"text": "China"
},
{
"relevance": 0.33,
"disambiguation": {
"subtype": [
"Country"
]
},
"sentiment": {
"score": 0
},
This is too much large file so I want to find "relevance" and "score" using python.
How fetch this info?
Regardless of how large it is, it is only a simple dictionary.
Iterate the lists. Extract the key-values.
for result in data['results']:
for e in result['enriched_text']['entities']:
print(e['relevance'])
print(e['sentiment']['score'])

dictionary does not give me unique Ids in python

I have the output of an elasticsearch query saved in a file. The first few lines looks like this:
{"took": 1,
"timed_out": false,
"_shards": {},
"hits": {
"total": 27,
"max_score": 6.5157733,
"hits": [
{
"_index": "dbgap_062617",
"_type": "dataset",
***"_id": "595189d15152c64c3b0adf16"***,
"_score": 6.5157733,
"_source": {
"dataAcquisition": {
"performedBy": "\n\t\tT\n\t\t"
},
"provenance": {
"ingestTime": "201",
},
"studyGroup": [
{
"Identifier": "1",
"name": "Diseas"
}
],
"license": {
"downloadURL": "http",
},
"study": {
"alternateIdentifiers": "yes",
},
"disease": {
"name": [
"Coronary Artery Disease"
]
},
"NLP_Fields": {
"CellLine": [],
"MeshID": [
"C0066533",
],
"DiseaseID": [
"C0010068"
],
"ChemicalID": [],
"Disease": [
"coronary artery disease"
],
"Chemical": [],
"Meshterm": [
"migen",
]
},
"datasetDistributions": [
{
"dateReleased": "20150312",
}
],
"dataset": {
"citations": [
"20032323"
],
**"description": "The Precoc.",**
**"title": "MIGen_ExS: PROCARDIS"**
},
.... and the list goes on with a bunch of other items ....
From all of these nodes I was interested in Unique _Ids, title, and description. So, I created a dictionary and extracted the parts that I was interested in using json. Here is my code:
import json
s={}
d=open('local file','w')
with open('localfile', 'r') as ready:
for line in ready:
test=json.loads(line, encoding='utf-8')
for i in (test['hits']['hits']):
for x in i:
s.setdefault(i['_id'], [i['_source']['dataset']
['description'], i['_source']['dataset']['title']])
for k, v in s.items():
d.write(k +'\t'+v[0] +'\t' + v[1] + '\n')
d.close()
Now, when I run it, it gives me a file with duplicated _Ids! Does not dictionary suppose to give me unique _Ids? In my original output file, I have lots of duplicated Ids that I wanted to get rid of them.
Also, I ran set() only on _ids to get unique number of them and it came to 138. But with dictionary if i remove generated duplicated ids it comes down to 17!
Can someone please tell me why this is happening?
If you want a unique ID, if you're using a database it will create it for you. If you're not, you'll need to generate a unique number or string. Depending on how the dictionaries are created, you could use the timestamp of when the dictionary was created, or you could use uuid.uuid4(). For more info on uuid, here are the docs.

Categories