Pandas groupby function with dict.update() - python

I'm trying to use the Pandas groupby function with a dict.update() function to each element. An example in a data frame (just for illustration):
A B
0 icon1 {'ap1': {'item' : 1}, 'ap2': {'item' : 2}}
1 icon1 {'ap3': {'item' : 3}}
What I'm trying to do is set something like
df = df.groupby('A')['B'].apply(', '.join).reset_index()
But instead of using python', '.join, I need to groupby the 'A' column and update each element in the 'B' column. I've tried using the map function, but I was not able to achieve anything useful.
The outcome should be:
A B
0 icon1 {'ap1': {'item' : 1}, 'ap2': {'item' : 2}, 'ap3': {'item' : 3}}
Is that even possible without changing the item type from dict?

Using dict comprehension
df.groupby('A').B.agg(lambda s: {k:v for a in s for k, v in a.items()}).reset_index()
A B
0 icon1 {'ap1': {'item' : 1}, 'ap2': {'item' : 2}, 'ap3': {'item' : 3}}

toolz.dicttoolz.merge
from toolz.dicttoolz import merge
df.groupby('A')['B'].agg(merge).reset_index()
A B
0 icon1 {'ap1': {'item': 1}, 'ap2': {'item': 2}, 'ap3'...
1 icon2 {'ap1': {'item': 1}, 'ap2': {'item': 2}, 'ap3'...
Setup
df = pd.DataFrame(dict(
A=['icon1', 'icon1', 'icon2', 'icon2'],
B=[{'ap1': {'item': 1}, 'ap2': {'item': 2}}, {'ap3': {'item': 3}}] * 2
))

You can use a helper function:
def func(x):
dct = {}
for i in x:
dct.update(i)
return dct
df.groupby('A')['B'].agg(func).reset_index()

Related

find cells with specific value and replace its value

Using pandas I have created a csv file containing 2 columns and saved my data into these columns. something like this:
fist second
{'value': 2} {'name': 'f'}
{'value': 2} {'name': 'h'}
{"value": {"data": {"n": 2, "m":"f"}}} {'name': 'h'}
...
Is there any way to look for all the rows whose the first column contains "data" and if any, only keep its value in this cell? I mean is it possible to change my third row from:
{"value": {"data": {"n": 2, "m":"f"}}} {'name': 'h'}
to something like this:
{"data": {"n": 2, "m":"f"}} {'name': 'h'}
and delete or replace the value of all other cells that does not contain data to something like -?
So my csv file will look like this:
fist second
- {'name': 'f'}
- {'name': 'h'}
{"data": {"n": 2, "m":"f"}} {'name': 'h'}
...
Here is my code:
import json
import pandas as pd
result = []
for line in open('file.json', 'r'):
result.append(json.loads(line))
df = pd.DataFrame(result)
print(df)
df.to_csv('document.csv')
f = pd.read_csv("document.csv")
keep_col = ['first', 'second']
new_f = f[keep_col]
new_f.to_csv("newFile.csv", index=False)
here is my short example:
df = pd.DataFrame({
'first' : [{'value': 2}, {'value': 2}, {"value": {"data": {"n": 2, "m":"f"}}}]
,'secound' : [{'name': 'f'}, {'name': 'h'},{'name': 'h'}]
})
a = pd.DataFrame(df["first"].tolist())
a[~a["value"].str.contains("data",regex=False).fillna(False)] = "-"
df["first"] = a.value
first step is to remove the 'value' field. after this the value field
if the new field contains the word "data" the field is set to true; all other fields are False, Numberic Fields have the value NaN this is Replaced with False. And the whole gets negated and replaced with "-"
last step is to overwrite the column in the original Data Frame.
Something like this might work.
first=[{'value': 2} , {'value': 2} , {"value": {"data": {"n": 2, "m":"f"}}}, {"data": {"n": 2, "m":"f"}}]
second=[{'name': 'f'}, {'name': 'h'}, {'name': 'h'}, {'name': 'h'}]
df = pd.DataFrame({'first': first,
'second': second})
f = lambda x: x.get('value', x) if isinstance(x, dict) else np.nan
df['first'] = df['first'].apply(f)
df['first'][~df["first"].str.contains("data",regex=False).fillna(False)] = "-"
print(df)
first second
0 - {'name': 'f'}
1 - {'name': 'h'}
2 {'data': {'n': 2, 'm': 'f'}} {'name': 'h'}
3 {'data': {'n': 2, 'm': 'f'}} {'name': 'h'}

list of dictionary column in a dataframe

A column in my dataframe is list of dictionaries some thing like this:
How can I filter the rows that have specific value for the id key in tag column? for instance the rows that contains {"id" : 18}
Since your tag column is list-valued, you could use explode if you are on pandas 0.25+:
# toy data
df = pd.DataFrame({'type':['df','fg','ff'],
'tag': [[{"id" : 12} ,{"id" : 13}],
[{"id" : 12}],
[{'id':10}]]
})
# make each row contains exactly one dict: {id: val}
s = df['tag'].explode()
# the indexes of interested rows
idx = s.index[pd.DataFrame(s.to_list())['id'].values==12]
# output
df.loc[idx]
Output:
type tag
0 df [{'id': 12}, {'id': 13}]
1 fg [{'id': 12}]
Sample DatFrame
df=pd.DataFrame({'type':['dg','fg','ff'],'tag':[[{"id" : 12} ,{"id" : 13}] ,[{"id" : 12}],[{"id" : 29}]]})
print(df)
type tag
0 dg [{'id': 12}, {'id': 13}]
1 fg [{'id': 12}]
2 ff [{'id': 29}]
Then you can use Series.apply to check each cell:
df_filtered=df[df['tag'].apply(lambda x: pd.Series([dict['id'] for dict in x]).eq(12).any())]
print(df_filtered)
type tag
0 dg [{'id': 12}, {'id': 13}]
1 fg [{'id': 12}]

Convert pandas dataframe to dictionary with nested dictionary based on 2 key columns and 1 value column

I have a dataframe in pandas as follows:
df = pd.DataFrame({'key1': ['abcd', 'defg', 'hijk', 'abcd'],
'key2': ['zxy', 'uvq', 'pqr', 'lkj'],
'value': [1, 2, 4, 5]})
I am trying to create a dictionary with a key of key1 and a nested dictionary of key2 and value. I have tried the following:
dct = df.groupby('key1')[['key2', 'value']].apply(lambda x: x.set_index('key2').to_dict(orient='index')).to_dict()
dct
{'abcd': {'zxy': {'value': 1}, 'lkj': {'value': 5}},
'defg': {'uvq': {'value': 2}},
'hijk': {'pqr': {'value': 4}}}
Desired output:
{'abcd': {'zxy': 1, 'lkj': 5}, 'defg': {'uvq': 2}, 'hijk': {'pqr': 4}}
Using collections.defaultdict, you can construct a defaultdict of dict objects and add elements while iterating your dataframe:
from collections import defaultdict
d = defaultdict(dict)
for row in df.itertuples(index=False):
d[row.key1][row.key2] = row.value
print(d)
defaultdict(dict,
{'abcd': {'lkj': 5, 'zxy': 1},
'defg': {'uvq': 2},
'hijk': {'pqr': 4}})
As defaultdict is a subclass of dict, this should require no further work.

Order by value in dictionary

I'm just practising with python. I have a dictionary in the form:
my_dict = [{'word': 'aa', 'value': 2},
{'word': 'aah', 'value': 6},
{'word': 'aahed', 'value': 9}]
How would I go about ordering this dictionary such that if I had thousands of words I would then be able to select the top 100 based on their value ranking? e.g., from just the above example:
scrabble_rank = [{'word': 'aahed', 'rank': 1},
{'word': 'aah', 'rank': 2},
{'word': 'aa', 'rank': 3}]
Firstly, that's not a dictionary; it's a list of dictionaries. Which is good, because dictionaries are unordered, but lists are ordered.
You can sort the list by the value of the rank element by using it as a key to the sort function:
scrabble_rank.sort(key=lambda x: x['value'])
Is this what you are looking for:
scrabble_rank = [{'word':it[1], 'rank':idx+1} for idx,it in enumerate(sorted([[item['value'],item['word']] for item in my_dict],reverse=True))]
Using Pandas Library:
import pandas as pd
There is this one-liner:
scrabble_rank = pd.DataFrame(my_dict).sort_values('value', ascending=False).reset_index(drop=True).reset_index().to_dict(orient='records')
It outputs:
[{'index': 0, 'value': 9, 'word': 'aahed'},
{'index': 1, 'value': 6, 'word': 'aah'},
{'index': 2, 'value': 2, 'word': 'aa'}]
Basically it reads your records into a DataFrame, then it sort by value in descending order, then it drops original index (order), and it exports as records (your previous format).
You can use heapq:
import heapq
my_dict = [{'word': 'aa', 'value': 2},
{'word': 'aah', 'value': 6},
{'word': 'aahed', 'value': 9}]
# Select the top 3 records based on `value`
values_sorted = heapq.nlargest(3, # fetch top 3
my_dict, # dict to be used
key=lambda x: x['value']) # Key definition
print(values_sorted)
[{'word': 'aahed', 'value': 9}, {'word': 'aah', 'value': 6}, {'word': 'aa', 'value': 2}]

python efficient group by

I am looking for the most efficient way to extract items from a list of dictionaries.I have a list of about 5k dictionaries. I need to extract those records/items for which grouping by a particular field gives more than a threshold T number of records. For example, if T = 2 and dictionary key 'id':
list = [{'name': 'abc', 'id' : 1}, {'name': 'bc', 'id' : 1}, {'name': 'c', 'id' : 1}, {'name': 'bbc', 'id' : 2}]
The result should be:
list = [{'name': 'abc', 'id' : 1}, {'name': 'bc', 'id' : 1}, {'name': 'c', 'id' : 1}]
i.e. All the records with some id such that there are atleast 3 records of same id.
l = [{'name': 'abc', 'id' : 1}, {'name': 'bc', 'id' : 1}, {'name': 'c', 'id' : 1}, {'name': 'bbc', 'id' : 2}]
from collections import defaultdict
from itertools import chain
d = defaultdict(list)
T = 2
for dct in l:
d[dct["id"]].append(dct)
print(list(chain.from_iterable(v for v in d.values() if len(v) > T)))
[{'name': 'abc', 'id': 1}, {'name': 'bc', 'id': 1}, {'name': 'c', 'id': 1}]
If you want to keep them in groups don't chain just use each value:
[v for v in d.values() if len(v) > T] # itervalues for python2
[[{'name': 'abc', 'id': 1}, {'name': 'bc', 'id': 1}, {'name': 'c', 'id': 1}]]
Avoid using list as a variable as it shadows the python list type and if you had a variable list then the code above would cause you a few problems in relation to d = defaultdict(list)
to start out I would make a dictionary to group by your id
control = {}
for d in list:
control.setdefault(d['id'],[]).append(d)
from here all you have to do is check the length of control to see if its greater than your specified threshold
put it in a function like so
def find_by_id(obj, threshold):
control = {}
for d in obj:
control.setdefault(d['id'], []).append(d)
for val in control.values():
if len(val) > threshold:
print val

Categories