Fastest way to find multiple entries from json using Python - python

I have a JSON that has around 50k items where each has an id and name as follows (I cut the data):
[
{
"id": 2,
"name": "Cannonball"
},
{
"id": 6,
"name": "Cannon base"
},
{
"id": 8,
"name": "Cannon stand"
},
{
"id": 10,
"name": "Cannon barrels"
},
{
"id": 12,
"name": "Cannon furnace"
},
{
"id": 28,
"name": "Insect repellent"
},
{
"id": 30,
"name": "Bucket of wax"
}]
Now, I have an array of item names and I want to find the corresponding id and to add it into an id array.
For example, I have itemName = ['Cannonball', 'Cannon furnace', 'Bucket of wax]
I would like to search inside the JSON and to return id_array = [2, 12, 30]
I wrote the following code which does the work however it seems like a huge waste of energy:
file_name = "database.json"
with open(file_name, 'r') as f:
document = json.loads(f.read())
items = ['Cannonball', 'Cannon furnace','Bucket of wax']
for item_name in items:
for entry in document:
if item_name == entry ['name']:
id_array.append(entry ['id'])
Is there any faster method that can do it?
The example above shows only 3 results but I'm talking about a few thousand and it feels like a waste to iterate over 1k+ results.
Thank you

Build a lookup dictionary mapping names to ids and then look up the names on that dictionary:
lookup = { d["name"] : d["id"] for d in document}
items = ['Cannonball', 'Cannon furnace','Bucket of wax']
result = [lookup[item] for item in items]
print(result)
Output
[2, 12, 30]
The time complexity of this approach is O(n + m) where n is the number of elements in the document (len(document)) and m is the number of items (len(items)), in contrast your approach is O(nm).
An alternative approach that uses less space, is to filter out those names that are not in items:
items = ['Cannonball', 'Cannon furnace', 'Bucket of wax']
item_set = set(items)
lookup = {d["name"]: d["id"] for d in document if d["name"] in item_set}
result = [lookup[item] for item in items]
This approach has the same time complexity as the previous one.

You could generate a dict which maps name to id first:
file_name = "database.json"
with open(file_name, 'r') as f:
document = json.loads(f.read())
name_to_id = {item["name"]:item["id"] for item in document}
Now you can just iterate over items:
items = ['Cannonball', 'Cannon furnace','Bucket of wax']
id_array = [ name_to_id[name] for name in items]

Related

Delete from JSON nested dictionary in list value by iteration

I have JSON file 'json_HW.json' in which I have this format JSON:
{
"news": [
{
"content": "Prices on gasoline have soared on 40%",
"city": "Minsk",
"news_date_and_time": "21/03/2022"
},
{
"content": "European shares fall on weak earnings",
"city": "Minsk",
"news_date_and_time": "19/03/2022"
}
],
"ad": [
{
"content": "Rent a flat in the center of Brest for a month",
"city": "Brest",
"days": 15,
"ad_start_date": "15/03/2022"
},
{
"content": "Sell a bookshelf",
"city": "Mogilev",
"days": 7,
"ad_start_date": "20/03/2022"
}
],
"coupon": [
{
"content": "BIG sales up to 50%!",
"city": "Grodno",
"days": 5,
"shop": "Marko",
"coupon_start_date": "17/03/2022"
}
]
}
I need to delete field_name and field_value with their keys when I reach them until the whole information in the file is deleted. When there is no information in the file, I need to delete the file itself
The code I have
data = json.load(open('json_HW.json'))
for category, posts in data.items():
for post in posts:
for field_name, field_value in post.items():
del field_name, field_value
print(data)
But the variable data doesn't change when I delete and delete doesn't work. If it worked I could rewrite my JSON
You are deleting the key and the value, after extracting them from the dictionary,
that doesn't affect the dictionary. What you should do is delete the dictionary entry:
import json
import os
file_name = 'json_HW.json'
data = json.load(open(file_name))
for category in list(data.keys()):
posts = data[category]
elem_indices = []
for idx, post in enumerate(posts):
for field_name in list(post.keys()):
del post[field_name]
if not post:
elem_indices.insert(0, idx) # so you get reverse order
for idx in elem_indices:
del posts[idx]
if not posts:
del data[category]
print(data)
if not data:
print('deleting', file_name)
os.unlink(file_name)
which gives:
{}
deleting json_HW.json
Note that the list() is necessary, post.keys() is a generator and
you cannot change the dict while you are iterating over its keys (or items or values).
if you want to delete key-value from dictionary, you can use del post[key].
but i don't think it works for iteration, cause dictionary size keeps changing.
https://www.geeksforgeeks.org/python-ways-to-remove-a-key-from-dictionary/

Find Average in python(json)

I have a JSON like below:
[
[
{
"subject": "Subject_1",
"qapali_correct_count": "12"
},
{
"subject": "Subject_2",
"qapali_correct_count": "9"
}
],
[
{
"subject": "Subject_1",
"qapali_correct_count": "14"
},
{
"subject": "Subject_2",
"qapali_correct_count": "15"
}
],
[
{
"subject": "Subject_1",
"qapali_correct_count": "11"
},
{
"subject": "Subject_2",
"qapali_correct_count": "12"
}
]
]
I have to output every subject's average: for example: subject_1 = 12.33, subject_2=12
I tried this code it works but I just wonder is there any option to speed up this code, Are there any other efficient ways to achieve it.
results = Result.objects.filter(exam=obj_exam, grade=obj_grade)
student_count = results.count()
final_data = {}
for result in results:
st_naswer_js = json.loads(result.student_answer_data_finish)
for rslt in st_naswer_js:
previus_data = final_data.get(rslt['subject'],0)
previus_data = previus_data+int(rslt['qapali_correct_count'])
final_data.update({rslt['subject']:previus_data})
for dudu, data in final_data.items():
tmp_data = data/student_count
final_data[dudu]=tmp_data
print(final_data)
Please note that it is a Django project.
The code in your question has several non-relevant bits. I'll stick to this part:
I have to output every subject's average: for example: subject_1 = 12.33, subject_2=12
I'll assume the list of results above is in a list called results. If it's json-loaded per student, handling that is probably already in your existing code. Primary focus below is on subject_score.
Store the score of each subject in a dictionary, whose values are lists of scores. I'm using a defaultdict here, with list as the default factory so when a dictionary value which doesn't exist is accessed, it gets initialised to an empty list (rather than throwing an KeyError which would happen with a standard dictionary.
import collections
subject_score = collections.defaultdict(list)
for result in results:
for stud_score in result:
# add each score to the list of scores for that subject
# use int or float above as needed
subject_score[stud_score['subject']].append(int(stud_score['qapali_correct_count']))
# `subject_score` is now:
# defaultdict(list, {'Subject_1': [12, 14, 11], 'Subject_2': [9, 15, 12]})
averages = {sub: sum(scores)/len(scores) for sub, scores in subject_score.items()}
averages is:
{'Subject_1': 12.333333333333334, 'Subject_2': 12.0}
Or you can print or save to file, db, etc. as needed.

python trasform data from csv to array of dictionaries and group by field value

I have csv like this:
id,company_name,country,country_id
1,batstop,usa, xx
2,biorice,italy, yy
1,batstop,italy, yy
3,legstart,canada, zz
I want an array of dictionaries to import to firebase. I need to group the different country informations for the same company in a nested list of dictionaries. This is the desired output:
[ {'id':'1', 'agency_name':'batstop', countries [{'country':'usa','country_id':'xx'}, {'country':'italy','country_id':'yy'}]} ,
{'id':'2', 'agency_name':'biorice', countries [{'country':'italy','country_id':'yy'}]},
{'id':'3', 'legstart':'legstart', countries [{'country':'canada','country_id':'zz'}]} ]
Recently I had a similar task, the groupby function from itertools and the itemgetter function from operator - both standard python libraries - helped me a lot. Here's the code considering your csv, note how defining the primary keys of your csv dataset is important.
import csv
import json
from operator import itemgetter
from itertools import groupby
primary_keys = ['id', 'company_name']
# Start extraction
with open('input.csv', 'r') as file:
# Read data from csv
reader = csv.DictReader(file)
# Sort data accordingly to primary keys
reader = sorted(reader, key=itemgetter(*primary_keys))
# Create a list of tuples
# Each tuple containing a dict of the group primary keys and its values, and a list of the group ordered dicts
groups = [(dict(zip(primary_keys, _[0])), list(_[1])) for _ in groupby(reader, key=itemgetter(*primary_keys))]
# Create formatted dict to be converted into firebase objects
group_dicts = []
for group in groups:
group_dict = {
"id": group[0]['id'],
"agency_name": group[0]['company_name'],
"countries": [
dict(country=_['country'], country_id=_['country_id']) for _ in group[1]
],
}
group_dicts.append(group_dict)
print("\n".join([json.dumps(_, indent=2) for _ in group_dicts]))
Here's the output:
{
"id": "1",
"agency_name": "batstop",
"countries": [
{
"country": "usa",
"country_id": " xx"
},
{
"country": "italy",
"country_id": " yy"
}
]
}
{
"id": "2",
"agency_name": "biorice",
"countries": [
{
"country": "italy",
"country_id": " yy"
}
]
}
{
"id": "3",
"agency_name": "legstart",
"countries": [
{
"country": "canada",
"country_id": " zz"
}
]
}
There's no external library,
Hope it suits you well!
You can try this, you may have to change a few parts to get it working with your csv, but hope it's enough to get you started:
csv = [
"1,batstop,usa, xx",
"2,biorice,italy, yy",
"1,batstop,italy, yy",
"3,legstart,canada, zz"
]
output = {} # dictionary useful to avoid searching in list for existing ids
# Parse each row
for line in csv:
cols = line.split(',')
id = int(cols[0])
agency_name = cols[1]
country = cols[2]
country_id = cols[3]
if id in output:
output[id]['countries'].append([{'country': country,
'country_id': country_id}])
else:
output[id] = {'id': id,
'agency_name': agency_name,
'countries': [{'country': country,
'country_id': country_id}]
}
# Put into list
json_output = []
for key in output.keys():
json_output.append( output[key] )
# Check output
for row in json_output:
print(row)

Robust way to sum all values corresponding to a particular objects property?

I have an array as such.
items = [
{
"title": "title1",
"category": "category1",
"value": 200
},
{
"title": "title2",
"category": "category2",
"value": 450
},
{
"title": "title3",
"category": "category1",
"value": 100
}
]
This array consists of many dictionaries with a property category and value.
What is the robust way of getting an array of category objects with their value summed like:
data= [
{
"category": "category1",
"value": 300
},
{
"category": "category2",
"value": 450
}
]
I'm looking for the best algorithm or way possible for both the small array as well as the huge array. If there is an existing algorithm please point me to the source.
What I tried??
data = []
for each item in items:
if data has a dictionary with dictionary.category == item.category:
data's dictionary.value = data's dictionary.value + item.value
else:
data.push({"category": item.category, "value":item.value})
Note: Any programming language is welcome. Please comment before downvoting.
In javascript, you can use reduce to group the array into an object. Use the category as the property. Use Object.values to convert the object into an array.
var items = [{
"title": "title1",
"category": "category1",
"value": 200
},
{
"title": "title2",
"category": "category2",
"value": 450
},
{
"title": "title3",
"category": "category1",
"value": 100
}
];
var data = Object.values(items.reduce((c, v) => {
c[v.category] = c[v.category] || {category: v.category,value: 0};
c[v.category].value += v.value;
return c;
}, {}));
console.log(data);
What you need is a SQL group by like operation. Usually, those group by operations are handling with hashing algorithms. If all your data could fit in memory (small to large data structures) you can implement it very quickly.
If your data structure is huge, you will need to use intermediate memory (such as hard drive or database).
An easy python approach will be:
data_tmp = {}
for item in items:
if item['category'] not in data_tmp:
data_tmp[item['category']] = 0
data_tmp[item['category']] += item['value']
data = []
for k, v in data_tmp.items():
data.append({
'category': k,
'value': v
})
# done
If you want more pythonic code you can use a defaultdict:
from collections import defaultdict
data_tmp = defaultdict(int)
for item in items:
data_tmp[item['category']] += item['value']
data = []
for k, v in data_tmp.items():
data.append({
'category': k,
'value': v
})
# done
In Python, Pandas is likely to be a more convenient and efficient way of doing this.
import pandas as pd
df = pd.DataFrame(items)
sums = df.groupby("category", as_index=False).sum()
data = sums.to_dict("records")
For the final step, it may be more convenient to leave sums as a dataframe and work with it like that instead of converting back to a list of dictionaries.
Using itertools.groupby
d = []
lista = sorted(items, key=lambda x: x['category'])
for k, g in groupby(lista, key=lambda x: x['category']):
temp = {}
temp['category'] = k
temp['value'] = sum([i['value'] for i in list(g)])
d.append(temp)
print(d)
# [{'category': 'category1', 'value': 300}, {'category': 'category2', 'value': 450}]

Index JSON searches python

I have the next JSON that I get from a URL:
[{
"id": 1,
"version": 23,
"external_id": "2312",
"url": "https://example.com/432",
"type": "typeA",
"date": "2",
"notes": "notes",
"title": "title",
"abstract": "dsadasdas",
"details": "something",
"accuracy": 0,
"reliability": 0,
"severity": 12,
"thing": "32132",
"other": [
"aaaaaaaaaaaaaaaaaa",
"bbbbbbbbbbbbbb",
"cccccccccccccccc",
"dddddddddddddd",
"eeeeeeeeee"
],
"nana": 8
},
{
"id": 2,
"version": 23,
"external_id": "2312",
"url": "https://example.com/432",
"type": "typeA",
"date": "2",
"notes": "notes",
"title": "title",
"abstract": "dsadasdas",
"details": "something",
"accuracy": 0,
"reliability": 0,
"severity": 12,
"thing": "32132",
"other": [
"aaaaaaaaaaaaaaaaaa",
"bbbbbbbbbbbbbb",
"cccccccccccccccc",
"dddddddddddddd",
"eeeeeeeeee"
],
"nana": 8
}]
My code:
import json
import urllib2
data = json.load(urllib2.urlopen('http://someurl/path/to/json'))
print data
I want to know how to access to the part "abstract" of the object that has "id" equal to 2 for example. The part "id" is unique so I can use id to index my searchs.
Thanks!
Here's one way to do it. You can create a generator via a generator expression, call next to iterate that generator once, and get back the desired object.
item = next((item for item in data if item['id'] == 2), None)
if item:
print item['abstract']
See also Python: get a dict from a list based on something inside the dict
EDIT : If you'd like access to all elements of the list that have a given key value (for example, id == 2) you can do one of two things. You can either create a list via comprehension (as shown in the other answer), or you can alter my solution:
my_gen = (item for item in data if item['id'] == 2)
for item in my_gen:
print item
In the loop, item will iterate over those items in your list which satisfy the given condition (here, id == 2).
You can use list comprehention to filter:
import json
j = """[{"id":1,"version":23,"external_id":"2312","url":"https://example.com/432","type":"typeA","date":"2","notes":"notes","title":"title","abstract":"dsadasdas","details":"something","accuracy":0,"reliability":0,"severity":12,"thing":"32132","other":["aaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbb","cccccccccccccccc","dddddddddddddd","eeeeeeeeee"],"nana":8},{"id":2,"version":23,"external_id":"2312","url":"https://example.com/432","type":"typeA","date":"2","notes":"notes","title":"title","abstract":"dsadasdas","details":"something","accuracy":0,"reliability":0,"severity":12,"thing":"32132","other":["aaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbb","cccccccccccccccc","dddddddddddddd","eeeeeeeeee"],"nana":8}]"""
dicto = json.loads(j)
results = [x for x in dicto if "id" in x and x["id"]==2]
And then you can print the 'abstract' values like so:
for result in results:
if "abstract" in result:
print result["abstract"]
import urllib2
import json
data = json.load(urllib2.urlopen('http://someurl/path/to/json'))
your_id = raw_input('enter the id')
for each in data:
if each['id'] == your_id:
print each['abstract']
In the above code data is list and each is a dict you can easily access the dict object.

Categories