Creating json for special values in the dict of dicts python - python

I have dictionary like that:
dic={'61': {'NAME': 'John', 'LASTNAME': 'X', 'EMAIL': 'X#example.com', 'GRADE': '99'}, '52': {'NAME': 'Jennifer', 'LASTNAME': 'Y', 'EMAIL': 'Y#example.com', 'GRADE': '98'}}
obj = json.dumps(dic,indent=3)
print(obj)
I want to create Json for some values.
{
"NAME": "John",
"LASTNAME": "X",
,
"NAME": "Jennifer",
"LASTNAME": "Y"
}
Any idea for help?

If I understand correctly you want to keep the values of your original data without the indices and also filter out some of them (keep only "NAME" and "LASTNAME"). You can do so by using a combination of dictionary and list comprehensions:
array = [{k:v for k,v in d.items()if k in ("NAME","LASTNAME")} for d in dic.values()]
This creates the following output:
>>> array
[{'NAME': 'John', 'LASTNAME': 'X'}, {'NAME': 'Jennifer', 'LASTNAME': 'Y'}]

Related

How to efficiently iterate over a very large list in pyspark

I have a sample data frame below:
firstname
middlename
lastname
id
gender
salary
James
Smith
36636
M
3000
Michael
Rose
40288
M
4000
Robert
Williams
42114
M
4000
Maria
Anne
Jones
39192
F
4000
Jen
Mary
Brown
F
-1
Now I want to convert this into a JSON list like the below:
[{'firstname': 'James', 'middlename': '', 'lastname': 'Smith', 'id': '36636', 'gender': 'M', 'salary': 3000}, {'firstname': 'Michael', 'middlename': 'Rose', 'lastname': '', 'id': '40288', 'gender': 'M', 'salary': 4000}, {'firstname': 'Robert', 'middlename': '', 'lastname': 'Williams', 'id': '42114', 'gender': 'M', 'salary': 4000}, {'firstname': 'Maria', 'middlename': 'Anne', 'lastname': 'Jones', 'id': '39192', 'gender': 'F', 'salary': 4000}, {'firstname': 'Jen', 'middlename': 'Mary', 'lastname': 'Brown', 'id': '', 'gender': 'F', 'salary': -1}]
and I did that using the below code:
result = json.loads((df.toPandas().to_json(orient="records")))
Now what I want to do is, I want to send the JSON records one by one and hit the API. I can't send all the records at once and there are millions of records to be sent. So, how do I segregate these records using Map() or some other way so that it would work in a distributed fashion? It works well when I iterate a for loop on this list but takes time. So wanted to implement the most efficient way for this use case. The for loop code is as below:
for i in result_json:
try:
token = get_token(tokenUrl, tokenBody)
custRequestBody = {
"Token": token,
"CustomerName": "",
"Object": "",
"Data": [i]
}
#print("::::Customer Request Body::::::")
#print(custRequestBody)
response = call_to_cust_bulk_api(apiUrl, custRequestBody)
output = {
"headers": {
"Content-Type": "",
"X-Content-Type-Options": "",
"X-XSS-Protection": "",
"X-Frame-Options": "DENY",
"Strict-Transport-Security": ""
},
"body": {
"Response code": 200,
"ResponseMessage": response
}
}
Here, the result_json is already converted to the JSON list of records:
You can perform operation row-wise on your df using a udf (user defined function).
Spark will run this function on all executors in a distributed fashion
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
# Your custom function you want to run in pyspark
#udf(returnType=IntegerType())
def parse_and_post(*args):
print(args, type(args)) # args is of type typle
# Convert the args tuple to json
# Send the json to API
# Return a Status value based on API success of failure
"""if success:
return 200
else
return -1"""
df = spark.createDataFrame([(1, "John Doe", 21), (2, "Simple", 33)], ("id", "name", "age"))
# Apply the UDF to your Dataframe (called "df")
new_df = df.withColumn("post_status", parse_and_post( *[df[x] for x in df.columns] ))
Note
You might be tempted to call collect() function on your df and then iterate on rows but it will load all the data into the driver. Which beats the purpose of distributed computation.
Also the function will not be executed until you use/show the new_df since spark's lazy evaluation.
Read more about udf here

List of Dictionary - How to combine a list of dictionary Python

a =[{
"id":"1",
"Name":'BK',
"Age":'56'
},
{
"id":"1",
"Sex":'Male'
},
{
"id":"2",
"Name":"AK",
"Age":"32"
}]
I have a list of dictionary with a person information split in multiple dictionary as above for ex above id 1's information is contained in first 2 dictionary , how can i get an output of below
{1: {'Name':'BK','Age':'56','Sex':'Male'}, 2: { 'Name': 'AK','Age':'32'}}
You can use a defaultdict to collect the results.
from collections import defaultdict
a =[{ "id":"1", "Name":'BK', "Age":'56' }, { "id":"1", "Sex":'Male' }, { "id":"2", "Name":"AK", "Age":"32" }]
results = defaultdict(dict)
key = lambda d: d['id']
for a_dict in a:
results[a_dict.pop('id')].update(a_dict)
This gives you:
>>> results
defaultdict(<class 'dict'>, {'1': {'Name': 'BK', 'Age': '56', 'Sex': 'Male'}, '2': {'Name': 'AK', 'Age': '32'}})
The defaultdict type behaves like a normal dict, except that when you reference an unknown value, a default value is returned. This means that as the dicts in a are iterated over, the values (except for id) are updated onto either an existing dict, or an automatic newly created one.
How does collections.defaultdict work?
Using defaultdict
from collections import defaultdict
a = [{
"id": "1",
"Name": 'BK',
"Age": '56'
},
{
"id": "1",
"Sex": 'Male'
},
{
"id": "2",
"Name": "AK",
"Age": "32"
}
]
final_ = defaultdict(dict)
for row in a:
final_[row.pop('id')].update(row)
print(final_)
defaultdict(<class 'dict'>, {'1': {'Name': 'BK', 'Age': '56', 'Sex': 'Male'}, '2': {'Name': 'AK', 'Age': '32'}})
You can combine 2 dictionaries by using the .update() function
dict_a = { "id":"1", "Name":'BK', "Age":'56' }
dict_b = { "id":"1", "Sex":'Male' }
dict_a.update(dict_b) # {'Age': '56', 'Name': 'BK', 'Sex': 'Male', 'id': '1'}
Since the output the you want is in dictionary form
combined_dict = {}
for item in a:
id = item.pop("id") # pop() remove the id key from item and return the value
if id in combined_dict:
combined_dict[id].update(item)
else:
combined_dict[id] = item
print(combined_dict) # {'1': {'Name': 'BK', 'Age': '56', 'Sex': 'Male'}, '2': {'Name': 'AK', 'Age': '32'}}
from collections import defaultdict
result = defaultdict(dict)
a =[{ "id":"1", "Name":'BK', "Age":'56' }, { "id":"1", "Sex":'Male' }, { "id":"2", "Name":"AK", "Age":"32" }]
for b in a:
result[b['id']].update(b)
print(result)
d = {}
for p in a:
id = p["id"]
if id not in d.keys():
d[id] = p
else:
d[id] = {**d[id], **p}
d is the result dictionary you want.
In the for loop, if you encounter an id for the first time, you just store the incomplete value.
If the id is in the existing keys, update it.
The combination happens in {**d[id], **p}
where ** is unpacking the dict.
It unpacks the existing incomplete dict associated withe the id and the current dict, then combine them into a new dict.

Creating a dictionary from two lists in python

I have a JSON data as below.
input_list = [["Richard",[],{"children":"yes","divorced":"no","occupation":"analyst"}],
["Mary",["testing"],{"children":"no","divorced":"yes","occupation":"QA analyst","location":"Seattle"}]]
I have another list where I have the prospective keys present
list_keys = ['name', 'current_project', 'details']
I am trying to create a dic using both to make the data usable for metrics
I have summarized the both the list for the question but it goes on forever, there are multiple elements in the list. input_list is a nested list which has 500k+ elements and each list element have 70+ elements of their own (expect the details one)
list_keys also have 70+ elements in it.
I was trying to create a dict using zip but that its not helping given the size of data, also with zip I am not able to exclude the "details" element from
I am expecting output something like this.
[
{
"name": "Richard",
"current_project": "",
"children": "yes",
"divorced": "no",
"occupation": "analyst"
},
{
"name": "Mary",
"current_project" :"testing",
"children": "no",
"divorced": "yes",
"occupation": "QA analyst",
"location": "Seattle"
}
]
I have tried this so far
>>> for line in input_list:
... zipbObj = zip(list_keys, line)
... dictOfWords = dict(zipbObj)
...
>>> print dictOfWords
{'current_project': ['testing'], 'name': 'Mary', 'details': {'location': 'Seattle', 'children': 'no', 'divorced': 'yes', 'occupation': 'QA analyst'}}
but with this I am unable to to get rid of nested dict key "details". so looking for help with that
Seems like what you wanted was a list of dictionaries, here is something i coded up in the terminal and copied in here. Hope it helps.
>>> list_of_dicts = []
>>> for item in input_list:
... dict = {}
... for i in range(0, len(item)-2, 3):
... dict[list_keys[0]] = item[i]
... dict[list_keys[1]] = item[i+1]
... dict.update(item[i+2])
... list_of_dicts.append(dict)
...
>>> list_of_dicts
[{'name': 'Richard', 'current_project': [], 'children': 'yes', 'divorced': 'no', 'occupation': 'analyst'
}, {'name': 'Mary', 'current_project': ['testing'], 'children': 'no', 'divorced': 'yes', 'occupation': '
QA analyst', 'location': 'Seattle'}]
I will mention it is not the ideal method of doing this since it relies on perfectly ordered items in the input_list.
people = input_list = [["Richard",[],{"children":"yes","divorced":"no","occupation":"analyst"}],
["Mary",["testing"],{"children":"no","divorced":"yes","occupation":"QA analyst","location":"Seattle"}]]
list_keys = ['name', 'current_project', 'details']
listout = []
for person in people:
dict_p = {}
for key in list_keys:
if not key == 'details':
dict_p[key] = person[list_keys.index(key)]
else:
subdict = person[list_keys.index(key)]
for subkey in subdict.keys():
dict_p[subkey] = subdict[subkey]
listout.append(dict_p)
listout
The issue with using zip is that you have that additional dictionary in the people list. This will get the following output, and should work through a larger list of individuals:
[{'name': 'Richard',
'current_project': [],
'children': 'yes',
'divorced': 'no',
'occupation': 'analyst'},
{'name': 'Mary',
'current_project': ['testing'],
'children': 'no',
'divorced': 'yes',
'occupation': 'QA analyst',
'location': 'Seattle'}]
This script will go through every item of input_list and creates new list where there aren't any list or dictionaries:
input_list = [
["Richard",[],{"children":"yes","divorced":"no","occupation":"analyst"}],
["Mary",["testing"],{"children":"no","divorced":"yes","occupation":"QA analyst","location":"Seattle"}]
]
list_keys = ['name', 'current_project', 'details']
out = []
for item in input_list:
d = {}
out.append(d)
for value, keyname in zip(item, list_keys):
if isinstance(value, dict):
d.update(**value)
elif isinstance(value, list):
if value:
d[keyname] = value[0]
else:
d[keyname] = ''
else:
d[keyname] = value
from pprint import pprint
pprint(out)
Prints:
[{'children': 'yes',
'current_project': '',
'divorced': 'no',
'name': 'Richard',
'occupation': 'analyst'},
{'children': 'no',
'current_project': 'testing',
'divorced': 'yes',
'location': 'Seattle',
'name': 'Mary',
'occupation': 'QA analyst'}]

Filter/group dictionary by nested value

Here‘s a simplified example of some data I have:
{"id": "1234565", "fields": {"name": "john", "email":"john#example.com", "country": "uk"}}
The wholeo nested dictionary is a bigger list of address data. The goal is to create pairs of people from the list with randomized partners where partners from the same country should be preferd. So my first real issue is to find a good way to group them by that country value.
I‘m sure there‘s a smarter way to do this than iterating through the dict and writing all records out to some new list/dict?
I think this is close to what you need:
result = {key:[i for i in value] for key, value in itertools.groupby(people, lambda item: item["fields"]["country"])}
What this does is use itertools.groupby to group all people in the people list by their specified country. The resulting dictionary has countries as keys, and the unpacked groupings (matching people) as values. Input is expected as a list of dictionaries like the one in your example:
people = [{"id": "1234565", "fields": {"name": "john", "email":"john#example.com", "country": "uk"}},
{"id": "654321", "fields": {"name": "sam", "email":"sam#example.com", "country": "uk"}}]
Sample output:
>>> print(result)
>>> {'uk': [{'fields': {'name': 'john', 'email': 'john#example.com', 'country': 'uk'}, 'id': '1234565'}, {'fields': {'name': 'sam', 'email': 'sam#example.com', 'country': 'uk'}, 'id': '654321'}]}
For a cleaner result, the looping construct can be tweaked so that only the ID of each person is included in the result dict:
result = {key:[i["id"] for i in value] for key, value in itertools.groupby(people, lambda item: item["fields"]["country"])}
>>> print(result)
>>> {'uk': ['1234565', '654321']}
EDIT: Sorry, I forgot about the sorting. Simply sort the list of people by country before putting it through groupby. It should now work properly:
sort = sorted(people, key=lambda item: item["fields"]["country"])
Here is another one that uses defaultdict:
import collections
def make_groups(nested_dicts, nested_key):
default = collections.defaultdict(list)
for nested_dict in nested_dicts:
for value in nested_dict.values():
try:
default[value[nested_key]].append(nested_dict)
except TypeError:
pass
return default
To test the results:
import random
COUNTRY = {'af', 'br', 'fr', 'mx', 'uk'}
people = [{'id': i, 'fields': {
'name': 'name'+str(i),
'email': str(i)+'#email',
'country': random.sample(COUNTRY, 1)[0]}}
for i in range(10)]
country_groups = make_groups(people, 'country')
for country, persons in country_groups.items():
print(country, persons)
Random output:
fr [{'id': 0, 'fields': {'name': 'name0', 'email': '0#email', 'country': 'fr'}}, {'id': 1, 'fields': {'name': 'name1', 'email': '1#email', 'country': 'fr'}}, {'id': 4, 'fields': {'name': 'name4', 'email': '4#email', 'country': 'fr'}}]
br [{'id': 2, 'fields': {'name': 'name2', 'email': '2#email', 'country': 'br'}}, {'id': 8, 'fields': {'name': 'name8', 'email': '8#email', 'country': 'br'}}]
uk [{'id': 3, 'fields': {'name': 'name3', 'email': '3#email', 'country': 'uk'}}, {'id': 7, 'fields': {'name': 'name7', 'email': '7#email', 'country': 'uk'}}]
af [{'id': 5, 'fields': {'name': 'name5', 'email': '5#email', 'country': 'af'}}, {'id': 9, 'fields': {'name': 'name9', 'email': '9#email', 'country': 'af'}}]
mx [{'id': 6, 'fields': {'name': 'name6', 'email': '6#email', 'country': 'mx'}}]

Sorting a list of dictionaries by all keys being unique

Pulling my hair out with this one.
I have a list of dictionaries without a unique primary ID key for each unique entry (the dictionary is built on the fly):
dicts = [{'firstname': 'john', 'lastname': 'doe', 'code': 'crumpets'},
{'firstname': 'john', 'lastname': 'roe', 'code': 'roe'},
{'firstname': 'john', 'lastname': 'doe', 'code': 'crumpets'},
{'firstname': 'thom', 'lastname': 'doe', 'code': 'crumpets'},
]
How do I go about filtering out lists of dictionaries like this where any repeating {} within the list are removed? So I need to check if all three of the dictionary keys match up with another in the list...and then discard that from the dict if that check is met.
So, for my example above, the first and third "entries" need to be removed as they are duplicates.
You use create frozensets from the dicts and put those in a set to remove dupes:
dcts = [dict(d) for d in set(frozenset(d.items()) for d in dcts)]
print(dcts)
[{'code': 'roe', 'firstname': 'john', 'lastname': 'roe'},
{'code': 'crumpets', 'firstname': 'thom', 'lastname': 'doe'},
{'code': 'crumpets', 'firstname': 'john', 'lastname': 'doe'}]
If you choose to remove all entries of the duplicates you can use a counter:
from collections import Counter
dcts = [dict(d) for d, cnt in Counter(frozenset(d.items()) for d in dcts).items()
if cnt==1]
print(dcts)
[{'code': 'roe', 'firstname': 'john', 'lastname': 'roe'},
{'code': 'crumpets', 'firstname': 'thom', 'lastname': 'doe'}]
Remove duplicates in a list of non-hashable elements requires you to make them hashable on the fly:
def remove_duplicated_dicts(elements):
seen = set()
result = []
for element in elements:
element_as_tuple = tuple(element.items())
if element_as_tuple not in seen:
seen.add(element_as_tuple)
result.append(element)
return result
d = [{'firstname': 'john', 'lastname': 'doe', 'code': "crumpets"},
{'firstname': 'john', 'lastname': 'roe', 'code': "roe"},
{'firstname': 'john', 'lastname': 'doe', 'code': "crumpets"},
{'firstname': 'thom', 'lastname': 'doe', 'code': "crumpets"},
]
print(remove_duplicated_dicts(d))
PS.
Non-obvious differences with the accepted answer of Moses Koledoye (as of 2017-06-19 at 13:00:00):
preservation of the original list order;
faster conversions: dict -> tuple instead of dict -> frozendict -> dict (take it with a grain of salt: I have made no benchmark).
Given the values of the dictionary are hashable, we can generate our own uniqness filter:
def uniq(iterable, key = lambda x:x):
keys = set()
for item in iterable:
ky = key(item)
if ky not in keys:
yield item
keys.add(ky)
We can then simply use the filter, like:
list(uniq(dicts,key=lambda x:(x['firstname'],x['lastname'],x['code'])))
The filter maintains the original order, and will - for this example - generate:
>>> list(uniq(dicts,key=lambda x:(x['firstname'],x['lastname'],x['code'])))
[{'code': 'crumpets', 'firstname': 'john', 'lastname': 'doe'},
{'code': 'roe', 'firstname': 'john', 'lastname': 'roe'},
{'code': 'crumpets', 'firstname': 'thom', 'lastname': 'doe'}]

Categories