python efficient group by

python efficient group by - python

I am looking for the most efficient way to extract items from a list of dictionaries.I have a list of about 5k dictionaries. I need to extract those records/items for which grouping by a particular field gives more than a threshold T number of records. For example, if T = 2 and dictionary key 'id':
list = [{'name': 'abc', 'id' : 1}, {'name': 'bc', 'id' : 1}, {'name': 'c', 'id' : 1}, {'name': 'bbc', 'id' : 2}]
The result should be:
list = [{'name': 'abc', 'id' : 1}, {'name': 'bc', 'id' : 1}, {'name': 'c', 'id' : 1}]
i.e. All the records with some id such that there are atleast 3 records of same id.

l = [{'name': 'abc', 'id' : 1}, {'name': 'bc', 'id' : 1}, {'name': 'c', 'id' : 1}, {'name': 'bbc', 'id' : 2}]
from collections import defaultdict
from itertools import chain
d = defaultdict(list)
T = 2
for dct in l:
d[dct["id"]].append(dct)
print(list(chain.from_iterable(v for v in d.values() if len(v) > T)))
[{'name': 'abc', 'id': 1}, {'name': 'bc', 'id': 1}, {'name': 'c', 'id': 1}]
If you want to keep them in groups don't chain just use each value:
[v for v in d.values() if len(v) > T] # itervalues for python2
[[{'name': 'abc', 'id': 1}, {'name': 'bc', 'id': 1}, {'name': 'c', 'id': 1}]]
Avoid using list as a variable as it shadows the python list type and if you had a variable list then the code above would cause you a few problems in relation to d = defaultdict(list)

to start out I would make a dictionary to group by your id
control = {}
for d in list:
control.setdefault(d['id'],[]).append(d)
from here all you have to do is check the length of control to see if its greater than your specified threshold
put it in a function like so
def find_by_id(obj, threshold):
control = {}
for d in obj:
control.setdefault(d['id'], []).append(d)
for val in control.values():
if len(val) > threshold:
print val

Related

maintain dictionary structure while reducing nested dictionary

I have a list of pairs of nested dict dd and would like to maintain the structure to a list of dictionaries:
dd = [
[{'id': 'bla',
'detail': [{'name': 'discard', 'amount': '123'},
{'name': 'KEEP_PAIR_1A', 'amount': '2'}]},
{'id': 'bla2',
'detail': [{'name': 'discard', 'amount': '123'},
{'name': 'KEEP_PAIR_1B', 'amount': '1'}]}
],
[{'id': 'bla3',
'detail': [{'name': 'discard', 'amount': '123'},
{'name': 'KEEP_PAIR_2A', 'amount': '3'}]},
{'id': 'bla4',
'detail': [{'name': 'discard', 'amount': '123'},
{'name': 'KEEP_PAIR_2B', 'amount': '4'}]}
]
]
I want to reduce this to a list of paired dictionaries while extracting only some detail. For example, an expected output may look like this:
[{'name': ['KEEP_PAIR_1A', 'KEEP_PAIR_1B'], 'amount': [2, 1]},
{'name': ['KEEP_PAIR_2A', 'KEEP_PAIR_2B'], 'amount': [3, 4]}]
I have run my code:
pair=[]
for all_pairs in dd:
for output_pairs in all_pairs:
for d in output_pairs.get('detail'):
if d['name'] != 'discard':
pair.append(d)
output_pair = {
k: [d.get(k) for d in pair]
for k in set().union(*pair)
}
But it didn't maintain that structure :
{'name': ['KEEP_PAIR_1A', 'KEEP_PAIR_1B', 'KEEP_PAIR_2A', 'KEEP_PAIR_2B'],
'amount': ['2', '1', '3', '4']}
I assume I would need to use some list comprehension to solve this but where in the for loop should I do that to maintain the structure.

Since you want to combine dictionaries in lists, one option is to use dict.setdefault:
pair = []
for all_pairs in dd:
dct = {}
for output_pairs in all_pairs:
for d in output_pairs.get('detail'):
if d['name'] != 'discard':
for k,v in d.items():
dct.setdefault(k, []).append(v)
pair.append(dct)
Output:
[{'name': ['KEEP_PAIR_1A', 'KEEP_PAIR_1B'], 'amount': [2, 1]},
{'name': ['KEEP_PAIR_2A', 'KEEP_PAIR_2B'], 'amount': [3, 4]}]

Python one liner to merge dictionary which has common values

What I have:
a=[{'name':'a','vals':1,'required':'yes'},{'name':'b','vals':2},{'name':'d','vals':3}]
b=[{'name':'a','type':'car'},{'name':'b','type':'bike'},{'name':'c','type':'van'}]
What I tried:
[[i]+[j] for i in b for j in a if i['name']==j['name']]
What I got:
[[{'name': 'a', 'type': 'car'}, {'name': 'a', 'vals': 1}], [{'name': 'b', 'type': 'bike'}, {'name': 'b', 'vals': 2}]]
What I want:
[{'name': 'a', 'type': 'car','vals': 1},{'name': 'b', 'type': 'bike','vals': 2}]
Note:
I need to merge dicts into one dict.
It should merge only those have common 'name' in both a and b.
I want python one liner answer.

For Python 3, you can do this:
a=[{'name':'a','vals':1},{'name':'b','vals':2},{'name':'d','vals':3}]
b=[{'name':'a','type':'car'},{'name':'b','type':'bike'},{'name':'c','type':'van'}]
print([{**i,**j} for i in b for j in a if i['name']==j['name']])

find cells with specific value and replace its value

Using pandas I have created a csv file containing 2 columns and saved my data into these columns. something like this:
fist second
{'value': 2} {'name': 'f'}
{'value': 2} {'name': 'h'}
{"value": {"data": {"n": 2, "m":"f"}}} {'name': 'h'}
...
Is there any way to look for all the rows whose the first column contains "data" and if any, only keep its value in this cell? I mean is it possible to change my third row from:
{"value": {"data": {"n": 2, "m":"f"}}} {'name': 'h'}
to something like this:
{"data": {"n": 2, "m":"f"}} {'name': 'h'}
and delete or replace the value of all other cells that does not contain data to something like -?
So my csv file will look like this:
fist second
- {'name': 'f'}
- {'name': 'h'}
{"data": {"n": 2, "m":"f"}} {'name': 'h'}
...
Here is my code:
import json
import pandas as pd
result = []
for line in open('file.json', 'r'):
result.append(json.loads(line))
df = pd.DataFrame(result)
print(df)
df.to_csv('document.csv')
f = pd.read_csv("document.csv")
keep_col = ['first', 'second']
new_f = f[keep_col]
new_f.to_csv("newFile.csv", index=False)

here is my short example:
df = pd.DataFrame({
'first' : [{'value': 2}, {'value': 2}, {"value": {"data": {"n": 2, "m":"f"}}}]
,'secound' : [{'name': 'f'}, {'name': 'h'},{'name': 'h'}]
})
a = pd.DataFrame(df["first"].tolist())
a[~a["value"].str.contains("data",regex=False).fillna(False)] = "-"
df["first"] = a.value
first step is to remove the 'value' field. after this the value field
if the new field contains the word "data" the field is set to true; all other fields are False, Numberic Fields have the value NaN this is Replaced with False. And the whole gets negated and replaced with "-"
last step is to overwrite the column in the original Data Frame.

Something like this might work.
first=[{'value': 2} , {'value': 2} , {"value": {"data": {"n": 2, "m":"f"}}}, {"data": {"n": 2, "m":"f"}}]
second=[{'name': 'f'}, {'name': 'h'}, {'name': 'h'}, {'name': 'h'}]
df = pd.DataFrame({'first': first,
'second': second})
f = lambda x: x.get('value', x) if isinstance(x, dict) else np.nan
df['first'] = df['first'].apply(f)
df['first'][~df["first"].str.contains("data",regex=False).fillna(False)] = "-"
print(df)
first second
0 - {'name': 'f'}
1 - {'name': 'h'}
2 {'data': {'n': 2, 'm': 'f'}} {'name': 'h'}
3 {'data': {'n': 2, 'm': 'f'}} {'name': 'h'}

Order by value in dictionary

I'm just practising with python. I have a dictionary in the form:
my_dict = [{'word': 'aa', 'value': 2},
{'word': 'aah', 'value': 6},
{'word': 'aahed', 'value': 9}]
How would I go about ordering this dictionary such that if I had thousands of words I would then be able to select the top 100 based on their value ranking? e.g., from just the above example:
scrabble_rank = [{'word': 'aahed', 'rank': 1},
{'word': 'aah', 'rank': 2},
{'word': 'aa', 'rank': 3}]

Firstly, that's not a dictionary; it's a list of dictionaries. Which is good, because dictionaries are unordered, but lists are ordered.
You can sort the list by the value of the rank element by using it as a key to the sort function:
scrabble_rank.sort(key=lambda x: x['value'])

Is this what you are looking for:
scrabble_rank = [{'word':it[1], 'rank':idx+1} for idx,it in enumerate(sorted([[item['value'],item['word']] for item in my_dict],reverse=True))]

Using Pandas Library:
import pandas as pd
There is this one-liner:
scrabble_rank = pd.DataFrame(my_dict).sort_values('value', ascending=False).reset_index(drop=True).reset_index().to_dict(orient='records')
It outputs:
[{'index': 0, 'value': 9, 'word': 'aahed'},
{'index': 1, 'value': 6, 'word': 'aah'},
{'index': 2, 'value': 2, 'word': 'aa'}]
Basically it reads your records into a DataFrame, then it sort by value in descending order, then it drops original index (order), and it exports as records (your previous format).

You can use heapq:
import heapq
my_dict = [{'word': 'aa', 'value': 2},
{'word': 'aah', 'value': 6},
{'word': 'aahed', 'value': 9}]
# Select the top 3 records based on `value`
values_sorted = heapq.nlargest(3, # fetch top 3
my_dict, # dict to be used
key=lambda x: x['value']) # Key definition
print(values_sorted)
[{'word': 'aahed', 'value': 9}, {'word': 'aah', 'value': 6}, {'word': 'aa', 'value': 2}]

all possible combinations of dicts based on values inside dicts

I want to generate all possible ways of using dicts, based on the values in them. To explain in code, I have:
a = {'name' : 'a', 'items': 3}
b = {'name' : 'b', 'items': 4}
c = {'name' : 'c', 'items': 5}
I want to be able to pick (say) exactly 7 items from these dicts, and all the possible ways I could do it in.
So:
x = itertools.product(range(a['items']), range(b['items']), range(c['items']))
y = itertools.ifilter(lambda i: sum(i)==7, x)
would give me:
(0, 3, 4)
(1, 2, 4)
(1, 3, 3)
...
What I'd really like is:
({'name' : 'a', 'picked': 0}, {'name': 'b', 'picked': 3}, {'name': 'c', 'picked': 4})
({'name' : 'a', 'picked': 1}, {'name': 'b', 'picked': 2}, {'name': 'c', 'picked': 4})
({'name' : 'a', 'picked': 1}, {'name': 'b', 'picked': 3}, {'name': 'c', 'picked': 3})
....
Any ideas on how to do this, cleanly?

Here it is
import itertools
import operator
a = {'name' : 'a', 'items': 3}
b = {'name' : 'b', 'items': 4}
c = {'name' : 'c', 'items': 5}
dcts = [a,b,c]
x = itertools.product(range(a['items']), range(b['items']), range(c['items']))
y = itertools.ifilter(lambda i: sum(i)==7, x)
z = (tuple([[dct, operator.setitem(dct, 'picked', vval)][0] \
for dct,vval in zip(dcts, val)]) for val in y)
for zz in z:
print zz
You can modify it to create copies of dictionaries. If you need a new dict instance on every iteration, you can change z line to
z = (tuple([[dct, operator.setitem(dct, 'picked', vval)][0] \
for dct,vval in zip(map(dict,dcts), val)]) for val in y)

easy way is to generate new dicts:
names = [x['name'] for x in [a,b,c]]
ziped = map(lambda x: zip(names, x), y)
maped = map(lambda el: [{'name': name, 'picked': count} for name, count in el],
ziped)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python efficient group by - python

Related

maintain dictionary structure while reducing nested dictionary

Python one liner to merge dictionary which has common values

find cells with specific value and replace its value

Order by value in dictionary

all possible combinations of dicts based on values inside dicts

Categories

Resources