Merge two list of dictionary based on key values - python

I have two list of dictionary:
a =[{ 'id': "1", 'date': "2017-01-24" },{ 'id': "2", 'date': "2018-01-24" },{ 'id': "3", 'date': "2019-01-24" }]
b =[{ 'id': "1", 'name': "abc" },{ 'id': "2",'name': "xyz"},{ 'id': "4",'name': "ijk"}]
I want to merge these dictionaries based on id and the result should be:
[{ 'id': "1", 'date': "2017-01-24",'name': "abc" },{ 'id': "2", 'date': "2018-01-24",'name': "xyz" },{ 'id': "3", 'date': "2019-01-24" },{ 'id': "4",'name': "ijk"}]
How can I do this without iterating in python?

since the dicts are stored in list, you'll either have to iterate or use a vectorized approach such as pandas.... for example:
import pandas as pd
a =[{ 'id': "1", 'date': "2017-01-24" },{ 'id': "2", 'date': "2018-01-24" }]
b =[{ 'id': "1", 'name': "abc" },{ 'id': "2",'name': "xyz"}]
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
out = df1.merge(df2, on='id').to_dict('r')
result:
[{'id': '1', 'date': '2017-01-24', 'name': 'abc'}, {'id': '2', 'date': '2018-01-24', 'name': 'xyz'}]
without testing I'm not sure how this compares speed-wise to just simply iterating. It may take long to iterate, but pandas also has to construct the dataframe and convert output to dict so there's a tradeoff

Related

Python: Change a JSON value

Let's say I have the following JSON file named output.
{'fields': [{'name': 2, 'type': 'Int32'},
{'name': 12, 'type': 'string'},
{'name': 9, 'type': 'datetimeoffset'},
}],
'type': 'struct'}
If type key has a value datetimeoffset, I would like to change it to dateTime and if If type key has a value Int32, I would like to change it to integer and like this, I have multiple values to replace.
The expected output is
{'fields': [{ 'name': 2, 'type': 'integer'},
{ 'name': 12, 'type': 'string'},
{ 'name': 9, 'type': 'dateTime'},
,
}],
'type': 'struct'}
Can anyone help with this in Python?
You can try this out:
substitute = {"Int32": "integer", "datetimeoffset": "dateTime"}
x = {'fields': [
{'name': 2, 'type': 'Int32'},
{'name': 12, 'type': 'string'},
{'name': 9, 'type': 'datetimeoffset'}
],'type': 'struct'}
for i in range(len(x['fields'])):
if x['fields'][i]["type"] in substitute:
x['fields'][i]['type'] = substitute[x['fields'][i]['type']]
print(x)
You can use the following code. Include in equivalences dict the values you want to replace:
json = {
'fields': [
{'name': 2, 'type': 'Int32'},
{'name': 12, 'type': 'string'},
{'name': 9, 'type': 'datetimeoffset'},
],
'type': 'struct'
}
equivalences = {"datetimeoffset": "dateTime", "Int32": "integer"}
#Replace values based on equivalences dict
for i, data in enumerate(json["fields"]):
if data["type"] in equivalences.keys():
json["fields"][i]["type"] = equivalences[data["type"]]
print(json)
The output is:
{
"fields": [
{
"name": 2,
"type": "integer"
},
{
"name": 12,
"type": "string"
},
{
"name": 9,
"type": "dateTime"
}
],
"type": "struct"
}
simple but ugly way:
json_ ={'fields': [{'name': 2, 'type': 'Int32'},
{'name': 12, 'type': 'string'},
{'name': 9, 'type': 'datetimeoffset'}], 'type': 'struct'}
result = json.loads(json.dumps(json_ ).replace("datetimeoffset", "dateTime").replace("Int32", "integer"))

How to extract group count from dictionary?

I need to get the count of groups which is same 'id' and 'name'
Input:
myd = {
"Items": [
{
"id": 1,
"name": "ABC",
"value": 666
},
{
"id": 1,
"name": "ABC",
"value": 89
},
{
"id": 2,
"name": "DEF",
"value": 111
},
{
"id": 3,
"name": "GHI",
"value": 111
}
]
}
Expected output:
The count of {'id':1, 'name': 'ABC' } is 2
The count of {'id':2, 'name': 'DEF' } is 1
The count of {'id':3, 'name': 'GHI' } is 1
for total length we can get by len(myd) for single key its len(myd['id'])
How to get the count for the combination of id and name
You can use collections.OrderedDict and set both 'id' and 'name' as tuple keys. In this way, the OrderedDict automatically groups the dictionaries with same 'id' and 'name' values in order:
myd = {'Items': [
{'id':1, 'name': 'ABC', 'value': 666},
{'id':1, 'name': 'ABC', 'value': 89},
{'id':2, 'name': 'DEF', 'value': 111 },
{'id':3, 'name': 'GHI', 'value': 111 }]
}
from collections import OrderedDict
od = OrderedDict()
for d in myd['Items']:
od.setdefault((d['id'], d['name']), set()).add(d['value'])
for ks, v in od.items():
print("The count of {{'id': {}, 'name': {}}} is {}".format(ks[0], ks[1], len(v)))
Output:
The count of {'id': 1, 'name': ABC} is 2
The count of {'id': 2, 'name': DEF} is 1
The count of {'id': 3, 'name': GHI} is 1
This is a good candidate for groupby and itemgetter usage:
from itertools import groupby
from operator import itemgetter
myd = {'Items': [
{'id': 1, 'name': 'ABC', 'value': 666},
{'id': 1, 'name': 'ABC', 'value': 89},
{'id': 2, 'name': 'DEF', 'value': 111},
{'id': 3, 'name': 'GHI', 'value': 111}]
}
grouper = itemgetter('id', 'name')
for i, v in groupby(sorted(myd['Items'], key=grouper), key=grouper):
print(f"the count for {dict(id=i[0], name=i[1])} is {len(list(v))}")

Reading un-even JSON with Python

I have a program that's taking a LARGE JSON file and reading through the structure, grabbing everything where the key matches something, then storing a number of items form that structure into the database. The problem is that sometimes the structure is off when there is only one item... so as follows:
"stats": {
"first": [
{
"name": "Name1",
"context": "open",
"number": "139"
},
{
"name": "Name2",
"context": "opener",
"number": "135"
}
],
"second": {
"name": "Name1",
"context": "opener",
"amount": "1.5",
"number": "-125"
},
"third": [
{
"name": "Name1",
"context": "open",
"amount": "8.5",
"number": "-110"
},
{
"name": "Name2",
"context": "open",
"amount": "9.0",
"number": "-120"
}
]
}
},
So, you'll notice that second only has one entry, so it's structured differently... I've tried more conditionals than I can think of... how do I check if it's a single entry and move forward? This is probably REALLY simple, I'm just at a loss and not the best at Python data structures (admittedly).
What I'm doing after is grabbign like third[0]['name'] and putting it into a database... so I get an index error when I try on that second node. Also - in some nodes, second WILL have more than one... in others it won't... totally depends on the record.
I would first parse it to a JSON, and then update the dictionary you describe that has keys like "first", "second", etc. as follows:
def repair_dict(d):
for k in list(d):
v = d[k]
if not isinstance(v,list):
d[k] = [v]
It thus repairs the data like:
>>> d = json.loads(data)
>>> d
{'stats': {'third': [{'context': 'open', 'name': 'Name1', 'number': '-110', 'amount': '8.5'}, {'context': 'open', 'name': 'Name2', 'number': '-120', 'amount': '9.0'}], 'second': {'context': 'opener', 'name': 'Name1', 'number': '-125', 'amount': '1.5'}, 'first': [{'context': 'open', 'name': 'Name1', 'number': '139'}, {'context': 'opener', 'name': 'Name2', 'number': '135'}]}}
>>> repair_dict(d['stats'])
>>> d
{'stats': {'third': [{'context': 'open', 'name': 'Name1', 'number': '-110', 'amount': '8.5'}, {'context': 'open', 'name': 'Name2', 'number': '-120', 'amount': '9.0'}], 'second': [{'context': 'opener', 'name': 'Name1', 'number': '-125', 'amount': '1.5'}], 'first': [{'context': 'open', 'name': 'Name1', 'number': '139'}, {'context': 'opener', 'name': 'Name2', 'number': '135'}]}}
Or when pretty printing:
>>> pprint.pprint(d)
{'stats': {'first': [{'context': 'open', 'name': 'Name1', 'number': '139'},
{'context': 'opener', 'name': 'Name2', 'number': '135'}],
'second': [{'amount': '1.5',
'context': 'opener',
'name': 'Name1',
'number': '-125'}],
'third': [{'amount': '8.5',
'context': 'open',
'name': 'Name1',
'number': '-110'},
{'amount': '9.0',
'context': 'open',
'name': 'Name2',
'number': '-120'}]}}

How can I create aggregate expressions of this list of dicts?

I have a list of dictionaries that expresses periods+days for a class in a student information system. Here's the data I'd like to aggregate:
[
{
'period': {
'name': '1',
'sort_order': 1
},
'day': {
'name': 'A',
'sort_order': 1
}
},
{
'period': {
'name': '1',
'sort_order': 1
},
'day': {
'name': 'B',
'sort_order': 2
}
},
{
'period': {
'name': '1',
'sort_order': 1
},
'day': {
'name': 'C',
'sort_order': 1
}
},
{
'period': {
'name': '3',
'sort_order': 3
},
'day': {
'name': 'A',
'sort_order': 1
}
},
{
'period': {
'name': '3',
'sort_order': 3
},
'day': {
'name': 'B',
'sort_order': 2
}
},
{
'period': {
'name': '3',
'sort_order': 3
},
'day': {
'name': 'C',
'sort_order': 2
}
},
{
'period': {
'name': '4',
'sort_order': 4
},
'day': {
'name': 'D',
'sort_order': 3
}
}
]
The aggregated string I'd like the above to reduce to is 1,3(A-C) 4(D). Notice that objects that aren't "adjacent" (determined by the object's sort_order) to each other are delimited by , and "adjacent" records are delimited by a -.
EDIT
Let me try to elaborate on the aggregation process. Each "class meeting" object contains a period and day. There are usually ~5 periods per day, and the days alternate cyclically between A,B,C,D, etc. So if I have a class that occurs 1st period on an A day, we might express that as 1(A). If a class occurs on 1st and 2nd period on an A day, the raw form of that might be 1(A),2(A), but it can be shortened to 1-2(A).
Some classes might not be in "adjacent" periods or days. A class might occur on 1st period and 3rd period on an A day, so its short form would be 1,3(A). However, if that class were on 1st, 2nd, and 3rd period on an A day, it could be written as 1-3(A). This also applies to days, so if a class occurs on 1st,2nd, and 3rd period, on A,B, and C day, then we could write it 1-3(A-C).
Finally, if a class occurs on 1st,2nd, and 3rd period and on A,B, and C day, but also on 4th period on D day, its short form would be 1-3(A-C) 4(D).
What I've tried
The first step that occurs to me to perform is to "group" the meeting objects into related sub-lists with the following function:
def _to_related_lists(list):
"""Given a list of section meeting dicts, return a list of lists, where each sub-list is list of
related section meetings, either related by period or day"""
related_list = []
sub_list = []
related_values = set()
for index, section_meeting_object in enumerate(list):
# starting with empty values list
if not related_values:
related_values.add(section_meeting_object['period']['name'])
related_values.add(section_meeting_object['day']['name'])
sub_list.append(section_meeting_object)
elif section_meeting_object['period']['name'] in related_values or section_meeting_object['day']['name'] in related_values:
related_values.add(section_meeting_object['period']['name'])
related_values.add(section_meeting_object['day']['name'])
sub_list.append(section_meeting_object)
else:
# no related values found in current section_meeting_object
related_list.append(sub_list)
sub_list = []
related_values = set()
related_values.add(section_meeting_object['period']['name'])
related_values.add(section_meeting_object['day']['name'])
sub_list.append(section_meeting_object)
related_list.append(sub_list)
return related_list
Which returns:
[
[{
'period': {
'sort_order': 1,
'name': '1'
},
'day': {
'sort_order': 1,
'name': 'A'
}
}, {
'period': {
'sort_order': 1,
'name': '1'
},
'day': {
'sort_order': 2,
'name': 'B'
}
}, {
'period': {
'sort_order': 2,
'name': '2'
},
'day': {
'sort_order': 1,
'name': 'A'
}
}, {
'period': {
'sort_order': 2,
'name': '2'
},
'day': {
'sort_order': 2,
'name': 'B'
}
}],
[{
'period': {
'sort_order': 4,
'name': '4'
},
'day': {
'sort_order': 3,
'name': 'C'
}
}]
]
If the entire string 1-3(A-C) 4(D) is the aggregate expression I'd like in the end, let's call 1-3(A-C) and 4(D) "sub-expressions". Each related sub-list would be a "sub-expression", so I was thinking I'd somehow iterate through every sublist and create the sub-expression, but I"m not exactly sure how to do that.
First, let us define your list as d_list.
d_list = [
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 1, 'name': 'A'}},
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 2, 'name': 'B'}},
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 1, 'name': 'C'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 1, 'name': 'A'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 2, 'name': 'B'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 2, 'name': 'C'}},
{'period': {'sort_order': 4, 'name': '4'}, 'day': {'sort_order': 3, 'name': 'D'}},
]
Note that I use the python native module string to define that B is between A and C. Thus what you may want to do is
import string
agg0 = {}
for d in d_list:
name = d['period']['name']
if name not in agg0:
agg0[name] = []
day = d['day']
agg0[name].append(day['name'])
agg1 = {}
for k,v in agg0.items():
pos_in_alph = [string.ascii_lowercase.index(el.lower()) for el in v]
allowed_indexes = [max(pos_in_alph),min(pos_in_alph)]
agg1[k] = [el for el in v if string.ascii_lowercase.index(el.lower()) in allowed_indexes]
agg = {}
for k,v in agg1.items():
w = tuple(v)
if w not in agg:
agg[w] = {'ks':[],'gr':len(agg0[k])>2}
agg[w]['ks'].append(k)
print agg[w]
str_ = ''
for k,v in sorted(agg.items(), key=lambda item:item[0], reverse=False):
str_ += ' {pnames}({dnames})'.format(pnames=('-' if v['gr'] else ',').join(sorted(v['ks'])),
dnames='-'.join(k))
print(str_.strip())
which outputs 1-3(A-C) 4(D)
Following #NathanJones's comment, note that if d_list were defined as
d_list = [
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 1, 'name': 'A'}},
##{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 2, 'name': 'B'}},
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 1, 'name': 'C'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 1, 'name': 'A'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 2, 'name': 'B'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 2, 'name': 'C'}},
{'period': {'sort_order': 4, 'name': '4'}, 'day': {'sort_order': 3, 'name': 'D'}},
]
The code above would print 1,3(A-C) 4(D)

Python sort a JSON list by two key values

I have a JSON list looks like this:
[{ "id": "1", "score": "100" },
{ "id": "3", "score": "89" },
{ "id": "1", "score": "99" },
{ "id": "2", "score": "100" },
{ "id": "2", "score": "59" },
{ "id": "3", "score": "22" }]
I want to sort the id first, I used
sorted_list = sorted(json_list, key=lambda k: int(k['id']), reverse = False)
This will only sort the list by id, but base on id, I also want sort the score as will, the final list I want is like this:
[{ "id": "1", "score": "100" },
{ "id": "1", "score": "99" },
{ "id": "2", "score": "100" },
{ "id": "2", "score": "59" },
{ "id": "3", "score": "89" },
{ "id": "3", "score": "22" }]
So for each id, sort their score as well. Any idea how to do that?
use a tuple adding second sort key -int(k["score"]) to reverse the order when breaking ties and remove reverse=True:
sorted_list = sorted(json_list, key=lambda k: (int(k['id']),-int(k["score"])))
[{'score': '100', 'id': '1'},
{'score': '99', 'id': '1'},
{'score': '100', 'id': '2'},
{'score': '59', 'id': '2'},
{'score': '89', 'id': '3'},
{'score': '22', 'id': '3'}]
So we primarily sort by id from lowest-highest but we break ties using score from highest-lowest. dicts are also unordered so there is no way to put id before score when you print without maybe using an OrderedDict.
Or use pprint:
from pprint import pprint as pp
pp(sorted_list)
[{'id': '1', 'score': '100'},
{'id': '1', 'score': '99'},
{'id': '2', 'score': '100'},
{'id': '2', 'score': '59'},
{'id': '3', 'score': '89'},
{'id': '3', 'score': '22'}]

Categories