Goal: to read data from SQL table where a column contains JSON (arrays), extract certain keys/values from the JSON into new columns to then write to a new table. One of the joys of the original data format is that some data records are JSON arrays and some are not arrays (just JSON). Thus we may start with:
testcase = [(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]),
(2, {'a': 30, 'b': 40}),
(3, {'a': 100, 'b': 200, 'd': 300})]
for x in testcase:
print(x)
(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}])
(2, {'a': 30, 'b': 40})
(3, {'a': 100, 'b': 200, 'd': 300})
Note the first element of each tuple is the record id. The first record is an array of length two, the second and third records are not arrays. The desired output is (as a DataFrame):
a b data
1 1 2 '{"c": 3}'
1 11 12 '{"c": 13}'
2 30 40 '{}'
3 100 200 '{"d": 300}'
Here you can see I've extracted keys 'a' and 'b' from the dicts into new columns, leaving the remaining keys/values in situ. The empty dict for id=2 is desirable behaviour.
First, I extracted the id and the data into separate lists. I take this opportunity to make the dict into a list of dicts (of length 1) so the types are now consistent:
id = [x[0] for x in testcase]
data_col = [x[1] if type(x[1]) == list else [x[1]] for x in testcase]
for x in data_col:
print(x)
[{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]
[{'a': 30, 'b': 40}]
[{'a': 100, 'b': 200, 'd': 300}]
It feels a bit of a clunky extra step to have to extract id and data_col as separate lists, although at least we have the nice property that we're not copying data:
id[0] is testcase[0][0]
True
data_col[0] is testcase[0][1]
True
And, as I say, I had to deal with the issue that some records contained arrays of dicts and some just dicts, so this makes them all consistent.
The main nitty gritty happens here, where I perform a dict comprehension in a double list comprehension to iterate over each dict:
popped = [(id, {key: element.pop(key, None) for key in ['a', 'b']}) \
for id, row in zip(id, data_col) for element in row]
for x in popped:
print(x)
(1, {'a': 1, 'b': 2})
(1, {'a': 11, 'b': 12})
(2, {'a': 30, 'b': 40})
(3, {'a': 100, 'b': 200})
I need to be able to relate each new row with its original id, and the above achieves that, correctly reproducing the appropriate id value (1, 1, 2, 3). With a bit of housekeeping, I can then get all my target rows lined up:
import pandas as pd
from psycopg2.extras import Json
id2 = [x[0] for x in popped]
cols = [x[1] for x in popped]
data = [Json(item) for sublist in data_col for item in sublist]
popped_df = pd.DataFrame(cols, index=id2)
popped_df['data'] = data
And this gives me the desired DataFrame as shown above. But ... is all my messing about with lists necessary? I couldn't do a simple json_normalize because I don't want to extract all keys and it falls over with the combination of arrays and non-arrays.
It also needs to be as performant as possible as it's going to be processing multi-millions of rows. For this reason, I actually convert the DataFrame to a list using:
list(popped_df.itertuples())
to then pass to psycopg2.extras' execute_values()
so I may yet not bother constructing the DataFrame and just build the output list, but in this post I'm really asking if there's a cleaner, faster way to extract these specific keys from the dicts into new columns and rows, robust to whether the record is an array or not and keeping track of the associated record id.
I shied away from an end-to-end pandas approach, reading the data using pd.read_sql() as I was reading that DataFrame.to_sql() was relatively slow.
You could do something like this:
import pandas as pd
testcase = [(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]),
(2, {'a': 30, 'b': 40}),
(3, {'a': 100, 'b': 200, 'd': 300})]
def split_dict(d, keys=['a', 'b']):
"""Split the dictionary by keys"""
preserved = {key: value for key, value in d.items() if key in keys}
complement = {key: value for key, value in d.items() if key not in keys}
return preserved, complement
def get_row(val):
preserved, complement = split_dict(val)
preserved['data'] = complement
return preserved
rows = []
index = []
for i, values in testcase:
if isinstance(values, list):
for value in values:
rows.append(get_row(value))
index.append(i)
else:
rows.append(get_row(values))
index.append(i)
df = pd.DataFrame.from_records(rows, index=index)
print(df)
Output
a b data
1 1 2 {'c': 3}
1 11 12 {'c': 13}
2 30 40 {}
3 100 200 {'d': 300}
Your data is messy, since the second element of your testcase can be either a list or a dict. In this case, you can construct a list via a for loop, then feed to the pd.DataFrame constructor:
testcase = [(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]),
(2, {'a': 30, 'b': 40}),
(3, {'a': 100, 'b': 200, 'd': 300})]
L = []
for idx, data in testcase:
for d in ([data] if isinstance(data, dict) else data):
# string conversion not strictly necessary below
others = str({k: v for k, v in d.items() if k not in ('a', 'b')})
L.append((idx, d['a'], d['b'], others))
df = pd.DataFrame(L, columns=['index', 'a', 'b', 'data']).set_index('index')
print(df)
a b data
index
1 1 2 {'c': 3}
1 11 12 {'c': 13}
2 30 40 {}
3 100 200 {'d': 300}
Related
Part of the program I am developing has a 2D dict of length n.
Dictionary Example:
test_dict = {
0: {'A': 2, 'B': 1, 'C': 5},
1: {'A': 3, 'B': 1, 'C': 2},
2: {'A': 1, 'B': 1, 'C': 1},
3: {'A': 4, 'B': 2, 'C': 5}
}
All of the dictionaries have the same keys but different values. I need to sum all the values as to equal below.
I have tried to merge the dictionaries using the following:
new_dict = {}
for k, v in test_dict.items():
new_dict.setdefault(k, []).append(v)
I also tried using:
new_dict = {**test_dict[0], **test_dict[1], **test_dict[2], **test_dict[3]}
Unfortuntly I have not had any luck in getting the desired outcome.
Desired Outcome: outcome = {'A': 10, 'B': 5, 'C': 13}
How can I add all the values into a single dictionary?
Solution using pandas
Convert your dict to pandas.DataFrame and then do summation on columns and convert it back to dict.
import pandas as pd
df = pd.DataFrame.from_dict(test_dict, orient='index')
print(df.sum().to_dict())
Output:
{'A': 10, 'B': 5, 'C': 13}
Alternate solution
Use collections.Counter which allows you to add the values of same keys within dict
from collections import Counter
d = Counter()
for _,v in test_dict.items():
d.update(v)
print(d)
I'm pulling data from the database and assuming i have something like this:
Product Name Quantity
a 3
a 5
b 2
c 7
I want to sum the Quantity based on Product name, so this is what i want:
product = {'a':8, 'b':2, 'c':7 }
Here's what I'm trying to do after fetching the data from the database:
for row in result:
product[row['product_name']] += row['quantity']
but this will give me: 'a'=5 only, not 8.
Option 1: pandas
This is one way, assuming you begin with a pandas dataframe df. This solution has O(n log n) complexity.
product = df.groupby('Product Name')['Quantity'].sum().to_dict()
# {'a': 8, 'b': 2, 'c': 7}
The idea is you can perform a groupby operation, which produces a series indexed by "Product Name". Then use the to_dict() method to convert to a dictionary.
Option 2: collections.Counter
If you begin with a list or iterator of results, and wish to use a for loop, you can use collections.Counter for O(n) complexity.
from collections import Counter
result = [['a', 3],
['a', 5],
['b', 2],
['c', 7]]
product = Counter()
for row in result:
product[row[0]] += row[1]
print(product)
# Counter({'a': 8, 'c': 7, 'b': 2})
Option 3: itertools.groupby
You can also use a dictionary comprehension with itertools.groupby. This requires sorting beforehand.
from itertools import groupby
res = {i: sum(list(zip(*j))[1]) for i, j in groupby(sorted(result), key=lambda x: x[0])}
# {'a': 8, 'b': 2, 'c': 7}
If you insist on using loops, you can do this:
# fake data to make the script runnable
result = [
{'product_name': 'a', 'quantity': 3},
{'product_name': 'a', 'quantity': 5},
{'product_name': 'b', 'quantity': 2},
{'product_name': 'c', 'quantity': 7}
]
# solution with defaultdict and loops
from collections import defaultdict
d = defaultdict(int)
for row in result:
d[row['product_name']] += row['quantity']
print(dict(d))
The output:
{'a': 8, 'b': 2, 'c': 7}
Since you mention pandas
df.set_index('ProductName').Quantity.sum(level=0).to_dict()
Out[20]: {'a': 8, 'b': 2, 'c': 7}
Use tuple to store the result.
Edit:
Not clear if the data mentioned is really a dataframe.
If yes then li = [tuple(x) for x in df.to_records(index=False)]
li = [('a', 3), ('a', 5), ('b', 2), ('c', 7)]
d = dict()
for key, val in li:
val_old = 0
if key in d:
val_old = d[key]
d[key] = val + val_old
print(d)
Output
{'a': 8, 'b': 2, 'c': 7}
I can print variables in python.
for h in jl1["results"]["attributes-list"]["volume-attributes"]:
state = str(h["volume-state-attributes"]["state"])
if aggr in h["volume-id-attributes"]["containing-aggregate-name"]:
if state == "online":
print(h["volume-id-attributes"]["owning-vserver-name"]),
print(' '),
print(h["volume-id-attributes"]["name"]),
print(' '),
print(h["volume-id-attributes"]["containing-aggregate-name"]),
print(' '),
print(h["volume-space-attributes"]["size-used"]
These print function returns for example 100 lines. Now I want to print only top 5 values based on filter of "size-used".
I am trying to take these values in dictionary and filter out top five values for "size-used" but not sure how to take them in dictionary.
Some thing like this
{'vserver': (u'rcdn9-c01-sm-prod',), 'usize': u'389120', 'vname': (u'nprd_root_m01',), 'aggr': (u'aggr1_n01',)}
Any other options like namedtuples is also appreciated.
Thanks
To get a list of dictionaries sorted by a certain key, use sorted. Say I have a list of dictionaries with a and b keys and want to sort them by the value of the b element:
my_dict_list = [{'a': 3, 'b': 1}, {'a': 1, 'b': 4}, {'a': 4, 'b': 4},
{'a': 2, 'b': 7}, {'a': 2, 'b': 4.3}, {'a': 2, 'b': 9}, ]
my_sorted_dict_list = sorted(my_dict_list, key=lambda element: element['b'], reverse=True)
# Reverse is set to True because by default it sorts from smallest to biggest; we want to reverse that
# Limit to five results
biggest_five_dicts = my_sorted_dict_list[:5]
print(biggest_five_dicts) # [{'a': 2, 'b': 9}, {'a': 2, 'b': 7}, {'a': 2, 'b': 4.3}, {'a': 1, 'b': 4}, {'a': 4, 'b': 4}]
heapq.nlargest is the obvious way to go here:
import heapq
interesting_dicts = ... filter to keep only the dicts you care about (e.g. online dicts) ...
for large in heapq.nlargest(5, interesting_dicts,
key=lambda d: d["volume-space-attributes"]["size-used"]):
print(...)
How can I modify a list value inside dataframes? I am trying to adjust data received by JSON and the DataFrame is as below:
The dataframe has 'multiple dictionary' in one list.
Dataframe df:
id options
0 0 [{'a':1 ,'b':2, 'c':3, 'd':4},{'a':5 ,'b':6, 'c':7, 'd':8}]
1 1 [{'a':9 ,'b':10, 'c':11, 'd':12},{'a':13 ,'b':14, 'c':15, 'd':16}]
2 2 [{'a':9 ,'b':10, 'c':11, 'd':12},{'a':17 ,'b':18, 'c':19, 'd':20}]
If I want to use only 'a' and 'c' key / values in options how can I modify datafames? The expected result would be
Dataframe df:
id options
0 0 [{'a':1 ,'c':3},{'a':5 ,'c':7}]
1 1 [{'a':9, 'c':11},{'a':13,'c':15}]
2 2 [{'a':9 ,'c':11},{'a':17,c':19}]
I tried filtering but I could not assign the value to the dataframe
for x in totaldf['options']:
for y in x:
y = {a: y[a], 'c': y['c']} ...?
Using nested listed comprehension:
df['options'] = [[{'a': y['a'], 'c': y['b']} for y in x] for x in df['options']]
If you wanted to use a for loop it would be something like:
new_options = []
for x in df['options']:
row = []
for y in x:
row.append({a: y[a], 'c': y['c']})
new_options.append(row)
df['options'] = new_options
# An alternative vectorized solution.
df.options = df.options.apply(lambda x: [{k:v for k,v in e.items() if k in['a','c']} for e in x])
Out[398]:
id options
0 0 [{'a': 1, 'c': 3}, {'a': 5, 'c': 7}]
1 1 [{'a': 9, 'c': 11}, {'a': 13, 'c': 15}]
2 2 [{'a': 9, 'c': 11}, {'a': 17, 'c': 19}]
Assuming that there are two python list with the same structure like this:
var1 = [{'a':1,'b':2},{'c':2,'d':5,'h':4},{'c':2,'d':5,'e':4}]
var2 = [{'a':3,'b':2},{'c':1,'d':5,'h':4},{'c':5,'d':5,'e':4}]
In my case, i need to combine both of those list, so i'll get this value :
result = [{'a':4,'b':4},{'c':3,'d':10,'h':8},{'c':7,'d':10,'e':8}]
How can i do that?
zip-based one-liner comprehension:
result = [{k: d1[k]+d2[k] for k in d1} for d1, d2 in zip(var1, var2)]
This assumes that two dicts at the same index always have identical key sets.
Use list comprehensions to put the code in one line,
result = [{key : d1.get(key, 0)+d2.get(key, 0)
for key in set(d1.keys()) | set(d2.keys())} # union two sets
for d1, d2 in zip(var1, var2)]
print(result)
[{'a': 4, 'b': 4}, {'h': 8, 'c': 3, 'd': 10}, {'c': 7, 'e': 8, 'd': 10}]
This code takes into consideration the case that two dictionaries may not have the same keys.
var1 = [{'a':1,'b':2},{'c':2,'d':5,'h':4},{'c':2,'d':5,'e':4}]
var2 = [{'a':3,'b':2},{'c':1,'d':5,'h':4},{'c':5,'d':5,'e':4}]
res = []
for i in range(len(var1)):
dic = {}
dic1, dic2 = var1[i], var2[i]
for key, val in dic1.items(): // dic1.iteritems() in python 2.
dic[key] = dic1[key] + dic2[key]
res.append(dic)
>>>print(res)
[{'a': 4, 'b': 4}, {'c': 3, 'd': 10, 'h': 8}, {'c': 7, 'd': 10, 'e': 8}]
var1 = [{'a': 1, 'b': 2}, {'c': 2, 'd': 5, 'h': 4}, {'c': 2, 'd': 5, 'e': 4}]
var2 = [{'a': 3, 'b': 2}, {'c': 1, 'd': 5, 'h': 4}, {'c': 5, 'd': 5, 'e': 4}]
ret = []
for i, ele in enumerate(var1):
d = {}
for k, v in ele.items():
value = v
value += var2[i][k]
d[k] = value
ret.append(d)
print(ret)
For the sake of completeness, another zip-based one-liner that will work even if the dicts are uneven in the both lists:
result = [{k: d1.get(k, 0) + d2.get(k, 0) for k in set(d1) | set(d2)} for d1, d2 in zip(var1, var2)]
Would something like this help?
ar1 = [{'a':1,'b':2},{'c':2,'d':5,'h':4},{'c':2,'d':5,'e':4}]
var2 = [{'a':3,'b':2},{'c':1,'d':5,'h':4},{'c':5,'d':5,'e':4}]
combined_var = zip(var1, var2)
new_d = {}
list_new_ds = []
for i, j in combined_var:
new_d = {}
for key in i and j:
new_d[key] = i[key] + j[key]
list_new_ds.append(new_d)
list_new_ds = [{'a': 4, 'b': 4}, {'h': 8, 'c': 3, 'd': 10}, {'c': 7, 'e': 8, 'd': 10}]
To explain, the zip function merges the lists as a list of tuples. I then unpack the tuples and iterate through the keys in each dictionary and add the values for the same keys together using a new dictionary to store them. I then append the value to a list, and then re-initialise the temporary dictionary to empty before looking at the next tuple in the zipped list.
The order is different due to dictionary behaviour I believe.
I am a novice, so would appreciate any critiques of my answer!