I'm pulling data from the database and assuming i have something like this:
Product Name Quantity
a 3
a 5
b 2
c 7
I want to sum the Quantity based on Product name, so this is what i want:
product = {'a':8, 'b':2, 'c':7 }
Here's what I'm trying to do after fetching the data from the database:
for row in result:
product[row['product_name']] += row['quantity']
but this will give me: 'a'=5 only, not 8.
Option 1: pandas
This is one way, assuming you begin with a pandas dataframe df. This solution has O(n log n) complexity.
product = df.groupby('Product Name')['Quantity'].sum().to_dict()
# {'a': 8, 'b': 2, 'c': 7}
The idea is you can perform a groupby operation, which produces a series indexed by "Product Name". Then use the to_dict() method to convert to a dictionary.
Option 2: collections.Counter
If you begin with a list or iterator of results, and wish to use a for loop, you can use collections.Counter for O(n) complexity.
from collections import Counter
result = [['a', 3],
['a', 5],
['b', 2],
['c', 7]]
product = Counter()
for row in result:
product[row[0]] += row[1]
print(product)
# Counter({'a': 8, 'c': 7, 'b': 2})
Option 3: itertools.groupby
You can also use a dictionary comprehension with itertools.groupby. This requires sorting beforehand.
from itertools import groupby
res = {i: sum(list(zip(*j))[1]) for i, j in groupby(sorted(result), key=lambda x: x[0])}
# {'a': 8, 'b': 2, 'c': 7}
If you insist on using loops, you can do this:
# fake data to make the script runnable
result = [
{'product_name': 'a', 'quantity': 3},
{'product_name': 'a', 'quantity': 5},
{'product_name': 'b', 'quantity': 2},
{'product_name': 'c', 'quantity': 7}
]
# solution with defaultdict and loops
from collections import defaultdict
d = defaultdict(int)
for row in result:
d[row['product_name']] += row['quantity']
print(dict(d))
The output:
{'a': 8, 'b': 2, 'c': 7}
Since you mention pandas
df.set_index('ProductName').Quantity.sum(level=0).to_dict()
Out[20]: {'a': 8, 'b': 2, 'c': 7}
Use tuple to store the result.
Edit:
Not clear if the data mentioned is really a dataframe.
If yes then li = [tuple(x) for x in df.to_records(index=False)]
li = [('a', 3), ('a', 5), ('b', 2), ('c', 7)]
d = dict()
for key, val in li:
val_old = 0
if key in d:
val_old = d[key]
d[key] = val + val_old
print(d)
Output
{'a': 8, 'b': 2, 'c': 7}
Related
Part of the program I am developing has a 2D dict of length n.
Dictionary Example:
test_dict = {
0: {'A': 2, 'B': 1, 'C': 5},
1: {'A': 3, 'B': 1, 'C': 2},
2: {'A': 1, 'B': 1, 'C': 1},
3: {'A': 4, 'B': 2, 'C': 5}
}
All of the dictionaries have the same keys but different values. I need to sum all the values as to equal below.
I have tried to merge the dictionaries using the following:
new_dict = {}
for k, v in test_dict.items():
new_dict.setdefault(k, []).append(v)
I also tried using:
new_dict = {**test_dict[0], **test_dict[1], **test_dict[2], **test_dict[3]}
Unfortuntly I have not had any luck in getting the desired outcome.
Desired Outcome: outcome = {'A': 10, 'B': 5, 'C': 13}
How can I add all the values into a single dictionary?
Solution using pandas
Convert your dict to pandas.DataFrame and then do summation on columns and convert it back to dict.
import pandas as pd
df = pd.DataFrame.from_dict(test_dict, orient='index')
print(df.sum().to_dict())
Output:
{'A': 10, 'B': 5, 'C': 13}
Alternate solution
Use collections.Counter which allows you to add the values of same keys within dict
from collections import Counter
d = Counter()
for _,v in test_dict.items():
d.update(v)
print(d)
Goal: to read data from SQL table where a column contains JSON (arrays), extract certain keys/values from the JSON into new columns to then write to a new table. One of the joys of the original data format is that some data records are JSON arrays and some are not arrays (just JSON). Thus we may start with:
testcase = [(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]),
(2, {'a': 30, 'b': 40}),
(3, {'a': 100, 'b': 200, 'd': 300})]
for x in testcase:
print(x)
(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}])
(2, {'a': 30, 'b': 40})
(3, {'a': 100, 'b': 200, 'd': 300})
Note the first element of each tuple is the record id. The first record is an array of length two, the second and third records are not arrays. The desired output is (as a DataFrame):
a b data
1 1 2 '{"c": 3}'
1 11 12 '{"c": 13}'
2 30 40 '{}'
3 100 200 '{"d": 300}'
Here you can see I've extracted keys 'a' and 'b' from the dicts into new columns, leaving the remaining keys/values in situ. The empty dict for id=2 is desirable behaviour.
First, I extracted the id and the data into separate lists. I take this opportunity to make the dict into a list of dicts (of length 1) so the types are now consistent:
id = [x[0] for x in testcase]
data_col = [x[1] if type(x[1]) == list else [x[1]] for x in testcase]
for x in data_col:
print(x)
[{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]
[{'a': 30, 'b': 40}]
[{'a': 100, 'b': 200, 'd': 300}]
It feels a bit of a clunky extra step to have to extract id and data_col as separate lists, although at least we have the nice property that we're not copying data:
id[0] is testcase[0][0]
True
data_col[0] is testcase[0][1]
True
And, as I say, I had to deal with the issue that some records contained arrays of dicts and some just dicts, so this makes them all consistent.
The main nitty gritty happens here, where I perform a dict comprehension in a double list comprehension to iterate over each dict:
popped = [(id, {key: element.pop(key, None) for key in ['a', 'b']}) \
for id, row in zip(id, data_col) for element in row]
for x in popped:
print(x)
(1, {'a': 1, 'b': 2})
(1, {'a': 11, 'b': 12})
(2, {'a': 30, 'b': 40})
(3, {'a': 100, 'b': 200})
I need to be able to relate each new row with its original id, and the above achieves that, correctly reproducing the appropriate id value (1, 1, 2, 3). With a bit of housekeeping, I can then get all my target rows lined up:
import pandas as pd
from psycopg2.extras import Json
id2 = [x[0] for x in popped]
cols = [x[1] for x in popped]
data = [Json(item) for sublist in data_col for item in sublist]
popped_df = pd.DataFrame(cols, index=id2)
popped_df['data'] = data
And this gives me the desired DataFrame as shown above. But ... is all my messing about with lists necessary? I couldn't do a simple json_normalize because I don't want to extract all keys and it falls over with the combination of arrays and non-arrays.
It also needs to be as performant as possible as it's going to be processing multi-millions of rows. For this reason, I actually convert the DataFrame to a list using:
list(popped_df.itertuples())
to then pass to psycopg2.extras' execute_values()
so I may yet not bother constructing the DataFrame and just build the output list, but in this post I'm really asking if there's a cleaner, faster way to extract these specific keys from the dicts into new columns and rows, robust to whether the record is an array or not and keeping track of the associated record id.
I shied away from an end-to-end pandas approach, reading the data using pd.read_sql() as I was reading that DataFrame.to_sql() was relatively slow.
You could do something like this:
import pandas as pd
testcase = [(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]),
(2, {'a': 30, 'b': 40}),
(3, {'a': 100, 'b': 200, 'd': 300})]
def split_dict(d, keys=['a', 'b']):
"""Split the dictionary by keys"""
preserved = {key: value for key, value in d.items() if key in keys}
complement = {key: value for key, value in d.items() if key not in keys}
return preserved, complement
def get_row(val):
preserved, complement = split_dict(val)
preserved['data'] = complement
return preserved
rows = []
index = []
for i, values in testcase:
if isinstance(values, list):
for value in values:
rows.append(get_row(value))
index.append(i)
else:
rows.append(get_row(values))
index.append(i)
df = pd.DataFrame.from_records(rows, index=index)
print(df)
Output
a b data
1 1 2 {'c': 3}
1 11 12 {'c': 13}
2 30 40 {}
3 100 200 {'d': 300}
Your data is messy, since the second element of your testcase can be either a list or a dict. In this case, you can construct a list via a for loop, then feed to the pd.DataFrame constructor:
testcase = [(1, [{'a': 1, 'b': 2, 'c': 3}, {'a': 11, 'b': 12, 'c': 13}]),
(2, {'a': 30, 'b': 40}),
(3, {'a': 100, 'b': 200, 'd': 300})]
L = []
for idx, data in testcase:
for d in ([data] if isinstance(data, dict) else data):
# string conversion not strictly necessary below
others = str({k: v for k, v in d.items() if k not in ('a', 'b')})
L.append((idx, d['a'], d['b'], others))
df = pd.DataFrame(L, columns=['index', 'a', 'b', 'data']).set_index('index')
print(df)
a b data
index
1 1 2 {'c': 3}
1 11 12 {'c': 13}
2 30 40 {}
3 100 200 {'d': 300}
I have a pandas dictionary series, that takes the values like
0 {AA:25,BB:31}
1 {CC:45,AA:3}
2 {BB:3,CD:4,AA:5}
I want to create a dictionary out of it based on the key and its occurrence in series, like:
{AA:3,BB:2,CC:1,CD:1}
I doubt there is a "built-in" solutiuon for this, so you'd have to manually iterate and count each key in every dictionary.
import pandas as pd
from collections import defaultdict
ser = pd.Series([{'AA':25,'BB':31},
{'CC':45,'AA':3},
{'BB':3,'CD':4,'AA':5}])
count = defaultdict(int)
for d in ser:
for key in d:
count[key] += 1
print(count)
# defaultdict(<class 'int'>, {'CC': 1, 'BB': 2, 'AA': 3, 'CD': 1})
You could also use Counter, however this looks rather "forced" in this situation:
import pandas as pd
from collections import Counter
total = Counter()
ser = pd.Series([{'AA':25,'BB':31},
{'CC':45,'AA':3},
{'BB':3,'CD':4,'AA':5}])
for d in ser:
total.update(d.keys())
print(total)
# Counter({'AA': 3, 'BB': 2, 'CD': 1, 'CC': 1})
Turn your series in to a series of lists of keys, sum those creating a single list of keys, and use a Counter:
In [23]: pd.Series([{'AA':25,'BB':31},{'CC':45,'AA':3},{'BB':3,'CD':4,'AA':5}])
Out[23]:
0 {'AA': 25, 'BB': 31}
1 {'AA': 3, 'CC': 45}
2 {'CD': 4, 'AA': 5, 'BB': 3}
dtype: object
In [24]: series = _
In [34]: from collections import Counter
In [35]: Counter(series.apply(lambda x: list(x.keys())).sum())
Out[35]: Counter({'AA': 3, 'BB': 2, 'CC': 1, 'CD': 1})
Or using generator expressions and flattening:
In [37]: Counter(k for d in series for k in d.keys())
Out[37]: Counter({'AA': 3, 'BB': 2, 'CC': 1, 'CD': 1})
counter = dict()
for item in series:
for key in item:
counter[key] = counter.get(key, 0) + 1
Maybe it's a bit late but this is another way of doing it by using pandas built-in functions.
s = pd.Series([{'AA':25,'BB':31},
{'CC':45,'AA':3},
{'BB':3,'CD':4,'AA':5}])
#convert dict to a dataframe and count non nan elements and finally convert it to a dict.
s.apply(pd.Series).count().to_dict()
Out[651]: {'AA': 3, 'BB': 2, 'CC': 1, 'CD': 1}
I have a list of lists of data:
[[1422029700000, 230.84, 230.42, 230.31, 230.32, 378], [1422029800000, 231.84, 231.42, 231.31, 231.32, 379], ...]
and a list of keys:
['a', 'b', 'c', 'd', 'e']
I want to combine them to a dictionary of lists so it looks like:
['a': [1422029700000, 1422029800000], 'b': [230.84, 231.84], ...]
I can do this using loops but I am looking for a pythonic way.
It is quite simple:
In [1]: keys = ['a','b','c']
In [2]: values = [[1,2,3],[4,5,6],[7,8,9]]
In [7]: dict(zip(keys, zip(*values)))
Out[7]: {'a': (1, 4, 7), 'b': (2, 5, 8), 'c': (3, 6, 9)}
If you need lists as values:
In [8]: dict(zip(keys, [list(t) for t in zip(*values)]))
Out[8]: {'a': [1, 4, 7], 'b': [2, 5, 8], 'c': [3, 6, 9]}
or:
In [9]: dict(zip(keys, map(list, zip(*values))))
Out[9]: {'a': [1, 4, 7], 'b': [2, 5, 8], 'c': [3, 6, 9]}
Use:
{k: [d[i] for d in data] for i, k in enumerate(keys)}
Example:
>>> data=[[1422029700000, 230.84, 230.42, 230.31, 230.32, 378], [1422029800000, 231.84, 231.42, 231.31, 231.32, 379]]
>>> keys = ["a", "b", "c"]
>>> {k: [d[i] for d in data] for i, k in enumerate(keys)}
{'c': [230.42, 231.42], 'a': [1422029700000, 1422029800000], 'b': [230.84, 231.84]}
Your question has everything in a list so if you want a list of dicts:
l1= [[1422029700000, 230.84, 230.42, 230.31, 230.32, 378], [1422029800000, 231.84, 231.42, 231.31, 231.32, 379]]
l2 = ['a', 'b', 'c', 'd', 'e',"f"] # added f to match length of sublists
print([{a:list(b)} for a,b in zip(l2,zip(*l1))])
[{'a': [1422029700000, 1422029800000]}, {'b': [230.84, 231.84]}, {'c': [230.42, 231.42]}, {'d': [230.31, 231.31]}, {'e': [230.32, 231.32]}, {'f': [378, 379]}]
If you actually want a dict use a dict comprehension with zip:
print({a:list(b) for a,b in zip(l2,zip(*l1))})
{'f': [378, 379], 'e': [230.32, 231.32], 'a': [1422029700000, 1422029800000], 'b': [230.84, 231.84], 'c': [230.42, 231.42], 'd': [230.31, 231.31]}
You example also has a list of keys shorter than the length of your sublists so zipping will actually mean you lose values from your sublists so you may want to address that.
If you are using python2 you can use itertools.izip:
from itertools import izip
print({a:list(b) for a,b in izip(l2,zip(*l1))
I have found many threads for sorting by values like here but it doesn't seem to be working for me...
I have a dictionary of lists that have tuples. Each list has a different amount of tuples. I want to sort the dictionary by how many tuples each list contain.
>>>to_format
>>>{"one":[(1,3),(1,4)],"two":[(1,2),(1,2),(1,3)],"three":[(1,1)]}
>>>for key in some_sort(to_format):
print key,
>>>two one three
Is this possible?
>>> d = {"one": [(1,3),(1,4)], "two": [(1,2),(1,2),(1,3)], "three": [(1,1)]}
>>> for k in sorted(d, key=lambda k: len(d[k]), reverse=True):
print k,
two one three
Here is a universal solution that works on Python 2 & Python 3:
>>> print(' '.join(sorted(d, key=lambda k: len(d[k]), reverse=True)))
two one three
dict= {'a': [9,2,3,4,5], 'b': [1,2,3,4, 5, 6], 'c': [], 'd': [1,2,3,4], 'e': [1,2]}
dict_temp = {'a': 'hello', 'b': 'bye', 'c': '', 'd': 'aa', 'e': 'zz'}
def sort_by_values_len(dict):
dict_len= {key: len(value) for key, value in dict.items()}
import operator
sorted_key_list = sorted(dict_len.items(), key=operator.itemgetter(1), reverse=True)
sorted_dict = [{item[0]: dict[item [0]]} for item in sorted_key_list]
return sorted_dict
print (sort_by_values_len(dict))
output:
[{'b': [1, 2, 3, 4, 5, 6]}, {'a': [9, 2, 3, 4, 5]}, {'d': [1, 2, 3, 4]}, {'e': [1, 2]}, {'c': []}]