Merging list of python dictionaries by column value - python

I have data that is a list of python dictionaries, each representing a row in the data, and want to combine several of these into one dictionary.
I need to combine them by a common value in a single column, note the dictionaries to merge may or may not contain similar columns and values should be concatenated, not clobbered.
Here is an example (combining dicts by value in column 'a'):
data = [{ 'a':0, 'b':10, 'c':20 }
{ 'a':2, 'd':30, 'e':40 }
{ 'a':0, 'b':50, 'c':60 }
{ 'a':1, 'd':70, 'c':80 }
{ 'a':1, 'b':90, 'e':100 }]
Desired output is:
new_data = [{ 'a':0, 'b':[10,50], 'c':[20,60] }
{ 'a':1, 'd':[70], 'c':[80], 'b':[90], 'e':[100] }
{ 'a':2, 'd':[30], 'e':[40] }]
I have a simple function that can accomplish this, but need a faster method (Data has approx 1,000,000 rows and 20 columns). My method of finding the dictionaries I want to merge is very expensive.
Here is where I have an issue with computation time:
unique_idx, locations = [], {}
for i, row in enumerate(data):
_id = row['a']
if _id not in unique_idx:
unique_idx.append(_id)
locations[_id] = [i]
else:
locations[_id].append(i)
grouped_data = [data[loc] for loc in locations.values()]
I need a faster method to collect dictionaries that contain the same value in one column. Ideally I want a quick method with plain python, but if this can be done simply with a pandas DataFrame that is good as well.

Related

Nested Json Using pyspark

We have to build nested json using below structure in pyspark and i have added data that need to feed using this
Input Data structure
Data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
columns = ["data","action"]
df = spark.createDataFrame(zip(a1, a2), columns)
#Input data for json structure
a1=["Pune"]
a2=["YES"]
a3=["India"]
col=["DA_Stinf_city","DA_Stinf_NA_ID_GRANT","DA_country"]
data=spark.createDataFrame(zip(a1, a2,a3), col)
Expected result based on above data
{
"data": {
"studentinfo": {
"city": "Pune",
"name": {
"id": {
"grant": "YES"
}
}
},
"country": "india"
}
}
we have tried using F.struct function in manually but we have find dynamic way to build this json using df dataframe having data and action column
data.select(
F.struct(
F.struct(
F.struct(F.col("DA_Stinf_city")).alias("city"),
F.struct(
F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
).alias("name"),
).alias("studentinfo"),
F.struct(F.col("DA_country")).alias("country")
).alias("data")
)
The approach below should give the correct structure (with the wrong key names - if you are happy with the approach, which doesn't use DataFrame operations but rather works in the underlying RDD, then I can flesh it out):
def build_json(input, running={}):
new_input = {}
for hierarchy, value in input:
key = hierarchy.pop(0)
if len(hierarchy) == 0:
running[key] = value
else:
new_input[key] = new_input.get(key, []) + [(hierarchy, value)]
for key in new_input:
print(new_input[key])
running[key] = build_json(new_input[key], running={})
return running
data.rdd.map(
lambda x: build_json(
[(column.split("_"), value) for column, value in x.asDict().items()]
)
)
The basic idea is to get a set of tuples from the underlying RDD consisting of the column name broken into its json hierarchy and the value to insert into the hierarchy. Then the function build_json inserts the value into its correct place in the json hierarchy, while building out the json object recursively.

How to flatten nested json that has dict in a column after using json normalize?

This is the flattened version of the column. I still need the keys as column titles for the dataframe and the values as values for the corresponding column.
reaction
{ "veddra_term_code": "99026", "veddra_version": "3", "veddra_term_name": "Tablets, Abnormal" }
I want my data to look like this so I can add it to the dataframe.
veddra_term_code veddra_version veddra_term_name
99026 3 'Tablets, Abnormal'
Use f-strings. Theyre made for creating strings formatted like you want:
d = { "veddra_term_code": "99026", "veddra_version": "3", "veddra_term_name": "Tablets, Abnormal" }
s = f'veddra_term_code veddra_version veddra_term_name {d["veddra_term_code"]} {d["veddra_version"]} \'{d["veddra_term_name"]}\''
print(s) # prints veddra_term_code veddra_version veddra_term_name 99026 3 'Tablets, Abnormal'

Convert the dataframe to JSON Based on Column name

I have a dataframe which contains like this below, Am just providing one row !
Vessel_id,Org_id,Vessel_name,Good_Laden_Miles_Min,Good_Ballast_Miles_Min,Severe_Laden_Miles_Min,Severe_Ballast_Miles_Min
1,5,"ABC",10,15,25,35
I want to convert the dataframe to json in this format below,
{
Vessel_id:1,
Vessel_name:"ABC",
Org_id:5,
WeatherGood:{
Good_Laden_Miles_Min:10,
Good_Ballast_Miles_Min:15
},
weatherSevere:{
Severe_Laden_Miles_Min:25,
Severe_Ballast_Miles_Min:35
}
}
how to join all those columns starting with good into a WeatherGood and convert to JSON?
You can first convert the dataframe to a dictionary of records, then transform each record to your desired format. Finally, convert the list of records to JSON.
import json
records = df.to_dict('records')
for record in records:
record['WeatherGood'] = {
k: record.pop(k) for k in ('Good_Laden_Miles_Min', 'Good_Ballast_Miles_Min')
}
record['WeatherSevere'] = {
k: record.pop(k) for k in ('Severe_Laden_Miles_Min', 'Severe_Ballast_Miles_Min')
}
>>> json.dumps(records)
'[{"Vessel_id": 1, "Org_id": 5, "Vessel_name": "ABC", "WeatherGood": {"Good_Laden_Miles_Min": 10, "Good_Ballast_Miles_Min": 15}, "WeatherSevere": {"Severe_Laden_Miles_Min": 25, "Severe_Ballast_Miles_Min": 35}}]'

Convert python nested JSON-like data to dataframe

My records looks like this and I need to write it to a csv file:
my_data={"data":[{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}]}
which looks like json, but the next record starts with "data" and not "data1" which forces me to read each record separately. Then, I convert it to a dict using eval(), to iterate thru keys and values for a certain path to get to the values I need. Then, I generate a list of keys and values based on the keys I need. Then, a pd.dataframe() converts that list into a dataframe which I know how to convert to csv. My code that works is below. But I am sure there are better ways to do this. Mine scales poorly. Thx.
counter=1
k=[]
v=[]
res=[]
m=0
for line in f2:
jline=eval(line)
counter +=1
for items in jline:
k.append(jline[u'data'][0].keys())
v.append(jline[u'data'][0].values())
print 'keys are:', k
i=0
j=0
while i <3 :
while j <3:
if k[i][j]==u'id':
res.append(v[i][j])
j += 1
i += 1
#res is my result set
del k[:]
del v[:]
Changing my_data to be:
my_data = [{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data One
{"id":"xyz2","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data Two
{"id":"xyz3","type":"book","attributes":{"doc_type":"article","action":"cut"}}] # Data Three
You can dump this directly into a dataframe as so:
mydf = pd.DataFrame(my_data)
It's not clear what your data path would be, but if you are looking for specific combinations of id, type, etc. You could explicitly search
def find_my_way(data, pattern):
# pattern = {'id':'someid', 'type':'sometype'...}
res = []
for row in data:
if row.get('id') == pattern.get('id'):
res.append(row)
return row
mydf = pd.DataFrame(find_my_way(mydata, pattern))
EDIT:
Without going into how the api works, in pseudo-code, you'll want to do something like the following:
my_objects = []
calls = 0
while calls < maximum:
my_data = call_the_api(params)
data = my_data.get('data')
if not data:
calls+=1
continue
# Api calls to single objects usually return a dictionary, to group objects they return lists. This handles both cases
if isinstance(data, list):
my_objects = [*data, *my_objects]
elif isinstance(data, {}):
my_objects = [{**data}, *my_objects]
# This will unpack the data response into a list that you can then load into a DataFrame with the attributes from the api as the columns
df = pd.DataFrame(my_objects)
Assuming your data from the api looks like:
"""
{
"links": {},
"meta": {},
"data": {
"type": "FactivaOrganizationsProfile",
"id": "Goog",
"attributes": {
"key_executives": {
"source_provider": [
{
"code": "FACSET",
"descriptor": "FactSet Research Systems Inc.",
"primary": true
}
]
}
},
"relationships": {
"people": {
"data": {
"type": "people",
"id": "39961704"
}
}
}
},
"included": {}
}
"""
per the documentation, which is why I'm using my_data.get('data').
That should get you all of the data (unfiltered) into a DataFrame
Saving the DataFrame for the last bit is a bit more memory friendly

Using json_normalize for structured multi level dictionaries with lists

I've successfully transferred the data from a JSON file (structured as per the below example), into a three column ['tag', 'time', 'score'] DataFrame using the following iterative approach:
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
df.loc[len(df)] = [v['tag_id'], v1['time'], v1['value']]
However, while this ultimately achieves the desired result, it takes a huge amount of time to iterate through larger files with the same structure. I'm assuming that an iterative approach is not the ideal way to tackle this sort of problem. Using pandas.io.json.json_normalize instead, I've tried the following:
result = json_normalize(my_request, ['content'], ['data', 'score', ['time', 'value']])
Which returns KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('data',)). I believe I've misinterpreted the pandas documentation on json_normalize, and can't quite figure out how I should pass the parameters.
Can anyone point me in the right direction?
(alternatively using errors='ignore' returns ValueError: Conflicting metadata name data, need distinguishing prefix.)
JSON Structure
{
'content':[
{
'data':{
'score':[
{
'time':'2015-03-01 00:00:30',
'value':75.0
},
{
'time':'2015-03-01 23:50:30',
'value':58.0
}
]
},
'tag_id':320676
},
{
'data':{
'score':[
{
'time':'2015-03-01 00:00:25',
'value':78.0
},
{
'time':'2015-03-01 00:05:25',
'value':57.0
}
]
},
'tag_id':320677
}
],
'meta':None,
'requested':'2018-04-15 13:00:00'
}
However, while this ultimately achieves the desired result, it takes a huge amount of time to iterate through larger files with the same structure.
I would suggest the following:
Check whether the problem is with your iterated appends. Pandas is not very good at sequentially adding rows. How about this code:
tups = []
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
tups.append(v['tag_id'], v1['time'], v1['value'])
df = pd.DataFrame(tups, columns=['tag_id', 'time', 'value])
If the preceding is not fast enough, check if it's the JSON-parsing part with
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
v['tag_id'], v1['time'], v1['value']
It is probable that 1. will be fast enough. If not, however, check if ujson might be faster for this case.

Categories