How to convert dataframe into dictionary of sets? - python

I have a dataframe and want to convert a dictionary consists of set.
To be specific, my dataframe and what I want to make it as below:
month date
0 JAN 1
1 JAN 1
2 JAN 1
3 FEB 2
4 FEB 2
5 FEB 3
6 MAR 1
7 MAR 2
8 MAR 3
My goal:
dict = {'JAN' : {1}, 'FEB' : {2,3}, 'MAR' : {1,2,3}}
I also wrote a code below, however, I am not sure it is suitable.
In reality, the data is large,
so I would like to know any tips or other efficient (faster) way to make it.
import pandas as pd
df = pd.DataFrame({'month' : ['JAN','JAN','JAN','FEB','FEB','FEB','MAR','MAR','MAR'],
'date' : [1, 1, 1, 1, 2, 3, 1, 2, 3]})
df_list = df.values.tolist()
monthSet = ['JAN','FEB','MAR']
inst_id_dict = {}
for i in df_list:
monStr = i[0]
if monStr in monthSet:
inst_id = i[1]
inst_id_dict.setdefault(monStr, set([])).add(inst_id)

Let's try grouping on the "month' column, then aggregating by GroupBy.unique:
df.groupby('month', sort=False)['date'].unique().map(set).to_dict()
# {'JAN': [1], 'FEB': [2, 3], 'MAR': [1, 2, 3]}
Or, if you'd prefer a dictionary of sets, use Groupby.agg:
df.groupby('month', sort=False)['date'].agg(set).to_dict()
# {'JAN': {1}, 'FEB': {2, 3}, 'MAR': {1, 2, 3}}
Another idea is to iteratively build a dict (don't worry, despite using loops this is likely to outspeed the groupby option):
out = {}
for m, d in df.drop_duplicates(['month', 'date']).to_numpy():
out.setdefault(m, set()).add(d)
out
# {'JAN': {1}, 'FEB': {2, 3}, 'MAR': {1, 2, 3}}

Related

Is there a way to store a dictionary on each row of a dataframe column using a vectorized operation?

I am attempting to next a dictionary inside of a dataframe.
here's an example of what I have:
x y z
1 2 3
4 5 6
7 8 9
here's an example of what I want:
x y z
1 2 {'z':3}
4 5 {'z':6}
7 8 {'z':9}
For this specific application, the whole point of using pandas is the vectorized operations that are scalable and efficient. Is it possible to transform that column into a column of dictionaries? I have attempted to use string concatenation, but then it is stored in pandas as a string and not a dict, and returns later with quotations around the dictionary because it is a string.
Example
data = {'x': {0: 1, 1: 4, 2: 7}, 'y': {0: 2, 1: 5, 2: 8}, 'z': {0: 3, 1: 6, 2: 9}}
df = pd.DataFrame(data)
Code
df['z'] = pd.Series(df[['z']].T.to_dict())
df
x y z
0 1 2 {'z': 3}
1 4 5 {'z': 6}
2 7 8 {'z': 9}

Fill panel data with ranked timepoints in pandas

Given a DataFrame that represents instances of called customers:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]})
The data is ordered by time such that every customer is a time-series and every customer has different timestamps. Thus I need a column that consists of the ranked timepoints:
df_2 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5],
"call_nr" : [0,1,2,0,1,0,1,2,3,0,0,1]})
After trying different approaches I came up with this to create call_nr:
np.concatenate([np.arange(df["customer_id"].value_counts().loc[i]) for i in df["customer_id"].unique()])
It works, but I doubt this is best practice. Is there a better solution?
A simpler solution would be to groupby your 'customer_id' and use cumcount:
>>> df_1.groupby('customer_id').cumcount()
0 0
1 1
2 2
3 0
4 1
5 0
6 1
7 2
8 3
9 0
10 0
11 1
which you can assign back as a column in your dataframe

extract duplicate values with 3 or more duplicates in a column pandas dataframe

I'm trying to extract a dataframe which only shows duplicates with e.g 3 or more duplicates in a column. For example:
df = pd.DataFrame({
'one': pd.Series(['Berlin', 'Berlin', 'Tokyo', 'Stockholm','Berlin','Stockholm','Amsterdam']),
'two': pd.Series([1, 2, 3, 4, 5, 6, 7]),
'three': pd.Series([8, 9, 10, 11, 12])
})
Expected output:
one two three
0 Berlin 1 8
The extraction should only show the row of the first duplicate.
You could do it like this:
rows = df.groupby('one').filter(lambda group: group.shape[0] >= 3).groupby('one').first()
Output:
>>> rows
two three
one
Amsterdam 7 1.0
Berlin 1 8.0
It works with multiple groups of 3+ duplicates, too. I tested it.

What's an efficient way of aggregating multiple columns with multiple custom functions that use multiple columns in a pandas dataframe?

I have a grouped pandas dataframe. I want to aggregate multiple columns. For each column, there are multiple aggregate functions. This is pretty straightforward. The tricky part is that in each aggregate function, I want to access data in another column.
How would I go about doing this efficiently? Here's the code I already have:
import pandas
data = [
{
'id': 1,
'A': 1,
'B': 1,
'C': 1,
'D': 1,
'E': 1,
'F': 1,
},
{
'id': 1,
'A': 2,
'B': 2,
'C': 2,
'D': 2,
'E': 2,
'F': 2,
},
{
'id': 2,
'A': 3,
'B': 3,
'C': 3,
'D': 3,
'E': 3,
'F': 3,
},
{
'id': 2,
'A': 4,
'B': 4,
'C': 4,
'D': 4,
'E': 4,
'F': 4,
},
]
df = pandas.DataFrame.from_records(data)
def get_column(column, column_name):
return df.iloc[column.index][column_name]
def agg_sum_a_b(column_a):
return column_a.sum() + get_column(column_a, 'B').sum()
def agg_sum_a_b_divide_c(column_a):
return (column_a.sum() + get_column(column_a, 'B').sum()) / get_column(column_a, 'C').sum()
def agg_sum_d_divide_sum_e_f(column_d):
return column_d.sum() / (get_column(column_d, 'E').sum() + get_column(column_d, 'F').sum())
def multiply_then_sum(column_e):
return (column_e * get_column(column_e, 'F')).sum()
df_grouped = df.groupby('id')
df_agg = df_grouped.agg({
'A': [agg_sum_a_b, agg_sum_a_b_divide_c, 'sum'],
'D': [agg_sum_d_divide_sum_e_f, 'sum'],
'E': [multiply_then_sum]
})
This code produces this dataframe:
A D E
agg_sum_a_b agg_sum_a_b_divide_c sum agg_sum_d_divide_sum_e_f sum multiply_then_sum
id
1 6 2 3 0.5 3 5
2 14 2 7 0.5 7 25
Am I doing this correctly? Is there a better way of doing this? I find the way I access data in another column within the aggregate function a little awkward.
The real data and code I'm using has about 20 columns and around 40 aggregate functions. There could potentially be hundreds of groups as well with each group having hundreds of rows.
When I do this using the real data and aggregate functions, it can take several minutes which is too slow for my purposes. Any way to make this more efficient?
Edit: I'm using Python 3.6 and pandas 0.23.0 btw. Thanks!
Edit 2: Added an example where I don't call sum() on the columns.
First I think you need more apply than agg to access different columns at once. Here is an idea how to change a bit what you want to do. Let's first create a function regrouping the operation you want to do and return them as a list of results:
def operations_to_perfom (df_g):
df_g_sum = df_g.sum() #can do the same with mean, min, max ...
# return all the operation you want
return [ df_g_sum['A'] + df_g_sum['B'],
(df_g_sum['A'] + df_g_sum['B'])/df_g_sum['C'],
df_g_sum['A'],
float(df_g_sum['D'])/(df_g_sum['E']+df_g_sum['F']),
(df_g['E']*df_g['F']).sum() ]
#use apply to create a serie with id as index and a list of agg
df_values = df.groupby('id').apply(operations_to_perfom)
# now create the result dataframe from df_values with tolist() and index
df_agg = pd.DataFrame( df_values.tolist(), index=df_values.index,
columns=pd.MultiIndex.from_arrays([['A']*3+['D']+['E'],
['agg_sum_a_b', 'agg_sum_a_b_div_c' ,'sum', 'agg_sum_d_div_sum_e_f', 'e_mult_f']]))
and df_agg looks like:
A D E
agg_sum_a_b agg_sum_a_b_div_c sum agg_sum_d_div_sum_e_f e_mult_f
id
1 6 2 3 0.5 5
2 14 2 7 0.5 25

How to multiply to each value in each element in SArray in Python?

I'm using Graphlab, but I guess this question can apply to pandas.
import graphlab
sf = graphlab.SFrame({'id': [1, 2, 3], 'user_score': [{"a":4, "b":3}, {"a":5, "b":7}, {"a":2, "b":3}], 'weight': [4, 5, 2]})
I want to create a new column where the value of each element in 'user_score' is multiplied by the number in 'weight'. That is,
sf = graphlab.SFrame({'id': [1, 2, 3], 'user_score': [{"a":4, "b":3}, {"a":5, "b":7}, {"a":2, "b":3}], 'weight': [4, 5, 2]}, 'new':[{"a":16, "b":12}, {"a":25, "b":35}, {"a":4, "b":6}])
I tried to write a simple function below and applied to no avail. Any thoughts?
def trans(x, y):
d = dict()
for k, v in x.items():
d[k] = v*y
return d
sf.apply(trans(sf['user_score'], sf['weight']))
It got the following error message:
AttributeError: 'SArray' object has no attribute 'items'
I'm using pandas dataframe, but it should also work in your case.
import pandas as pd
df['new']=[dict((k,v*y) for k,v in x.items()) for x, y in zip(df['user_score'], df['weight'])]
Input dataframe:
df
Out[34]:
id user_score weight
0 1 {u'a': 4, u'b': 3} 4
1 2 {u'a': 5, u'b': 7} 5
2 3 {u'a': 2, u'b': 3} 2
Output:
df
Out[36]:
id user_score weight new
0 1 {u'a': 4, u'b': 3} 4 {u'a': 16, u'b': 12}
1 2 {u'a': 5, u'b': 7} 5 {u'a': 25, u'b': 35}
2 3 {u'a': 2, u'b': 3} 2 {u'a': 4, u'b': 6}
This is subtle, but I think what you want is this:
sf.apply(lambda row: trans(row['user_score'], row['weight']))
The apply function takes a function as its argument, and will pass each row as the parameter to that function. In your version, you are evaluating the trans function before apply is called, which is why the error message complains about passing an SArray to the trans function when a dict is expected.
here is one of many possible solutions:
In [69]: df
Out[69]:
id user_score weight
0 1 {'b': 3, 'a': 4} 4
1 2 {'b': 7, 'a': 5} 5
2 3 {'b': 3, 'a': 2} 2
In [70]: df['user_score'] = df['user_score'].apply(lambda x: pd.Series(x)).mul(df.weight, axis=0).to_dict('record')
In [71]: df
Out[71]:
id user_score weight
0 1 {'b': 12, 'a': 16} 4
1 2 {'b': 35, 'a': 25} 5
2 3 {'b': 6, 'a': 4} 2

Categories