Cut Pandas dataframe based on unique values per column

Cut Pandas dataframe based on unique values per column - python

I want to cut pandas data frame with duplicated values in a column into separate data frames.
So from this:
df = pd.DataFrame({'block': ['A', 'B', 'B', 'C'],
'd': [{'A': 1}, {'B': 'A'}, {'B': 3}, {'C': 'Z'}]})
block d
0 A {'A': 1 }
1 B {'B': 'A'}
2 B {'B': 3 }
3 C {'C': 'Z'}
I would like to achieve two separate data frames:
df1:
block d
0 A {'A': 1 }
1 B {'B': 'A'}
2 C {'C': 'Z'}
df2:
block d
0 A {'A': 1 }
1 B {'B': 3 }
2 C {'C': 'Z'}
Another example:
df = pd.DataFrame({'block': ['A', 'B', 'B', 'C', 'C'],
'd': [{'A': 1}, {'B': 'A'}, {'B': 3}, {'C': 'Z'}, {'C': 10}]})
Result:
block d
0 A {'A': 1 }
1 B {'B': 'A'}
2 C {'C': 'Z'}
block d
0 A {'A': 1 }
1 B {'B': 'A'}
2 C {'C': 10 }
block d
0 A {'A': 1 }
1 B {'B': 3 }
2 C {'C': 'Z'}
block d
0 A {'A': 1 }
1 B {'B': 3 }
2 C {'C': 10 }
I should add that I want to preserve the order of the column 'block'.
I tried pandas explode and itertools package but without good results. If someone knows how to solve this - please help.

One way using pandas.DataFrame.groupby, iterrows and itertools.product:
from itertools import product
prods = []
for _, d in df.groupby("block"):
prods.append([s for _, s in d.iterrows()])
dfs = [pd.concat(ss, axis=1).T for ss in product(*prods)]
print(dfs)
Output:
[ block d
0 A {'A': 1}
1 B {'B': 'A'}
3 C {'C': 'Z'},
block d
0 A {'A': 1}
2 B {'B': 3}
3 C {'C': 'Z'}]
Output for second sample df:
[ block d
0 A {'A': 1}
1 B {'B': 'A'}
3 C {'C': 'Z'},
block d
0 A {'A': 1}
1 B {'B': 'A'}
4 C {'C': 10},
block d
0 A {'A': 1}
2 B {'B': 3}
3 C {'C': 'Z'},
block d
0 A {'A': 1}
2 B {'B': 3}
4 C {'C': 10}]

Related

Convert multiple columns into dict after groupby in pandas

I have a dataframe
df = pd.DataFrame({"a":[1,1,1,2,2,2,3,3], "b":["a","a","a","b","b","b","c","c"], "c":[0,0,1,0,1,1,0,1], "d":["x","y","z","x","y","y","z","x"]})
a b c d
0 1 a 0 x
1 1 a 0 y
2 1 a 1 z
3 2 b 0 x
4 2 b 1 y
5 2 b 1 y
6 3 c 0 z
7 3 c 1 x
I want to groupby on column a and column b to get following output:
a b e
0 1 a [{'c': 0, 'd': 'x'}, {'c': 0, 'd': 'y'}, {'c': 1, 'd': 'z'}]
1 2 b [{'c': 0, 'd': 'x'}, {'c': 1, 'd': 'y'}, {'c': 1, 'd': 'y'}]
2 3 c [{'c': 0, 'd': 'z'}, {'c': 1, 'd': 'x'}]
My solution:
new_df = df.groupby(["a","b"])["c","d"].apply(lambda x: x.to_dict(orient="records")).reset_index(name="e")
But issue is it is behaving inconsistently, sometimes I get below error:
reset_index() got an unexpected keyword argument "name"
It would be helpful if someone point out issue in above solution or provide an alternate way of doing.

You can do
new=ddf.groupby(['a','b'])[['c','d']].apply(lambda x : x.to_dict('r')).to_frame('e').reset_index()
Out[13]:
a b e
0 1 a [{'c': 0, 'd': 'x'}, {'c': 0, 'd': 'y'}, {'c':...
1 2 b [{'c': 0, 'd': 'x'}, {'c': 1, 'd': 'y'}, {'c':...
2 3 c [{'c': 0, 'd': 'z'}, {'c': 1, 'd': 'x'}]

Alternatively we can do:
df['e'] = df[['c', 'd']].agg(lambda s: dict(zip(s.index, s.values)), axis=1)
df1 = df.groupby(['a', 'b'])['e'].agg(list).reset_index()
# print(df1)
a b e
0 1 a [{'c': 0, 'd': 'x'}, {'c': 0, 'd': 'y'}, {'c':...
1 2 b [{'c': 0, 'd': 'x'}, {'c': 1, 'd': 'y'}, {'c':...
2 3 c [{'c': 0, 'd': 'z'}, {'c': 1, 'd': 'x'}]

How to convert a co-occurrence matrix to networkx graph

I am using the following code to convert my list of lists to a co-occurrence matrix.
lst = [
['a', 'b'],
['b', 'c', 'd', 'e'],
['a', 'd'],
['b', 'e']
]
u = (pd.get_dummies(pd.DataFrame(lst), prefix='', prefix_sep='')
.groupby(level=0, axis=1)
.sum())
v = u.T.dot(u)
v.values[(np.r_[:len(v)], ) * 2] = 0
print(v)
My output is as follows:
a b c d e
a 0 1 0 1 0
b 1 0 1 1 2
c 0 1 0 1 1
d 1 1 1 0 1
e 0 2 1 1 0
I would like to transform my co-occurrence matrix to a weighted undirected networkx graph where weights represent the co-occurrence count in the matrix.
Currently, I have tried to do it as follows. However, I am not sure how I can insert weights to the graph.
print("get x and y pairs")
#get (x,y) pairs from the cooccurrence matrix
arr = np.where(v>=1)
corrs = [(v.index[x], v.columns[y]) for x, y in zip(*arr)]
#get the unique pairs
final_arr = []
for x, y in corrs:
if (y,x) not in final_arr:
final_arr.append((x,y))
#construct the graph
G = nx.Graph()
nodes_vocabulary_list = ['a', 'b', 'c', 'd', 'e']
G.add_nodes_from(nodes_vocabulary_list)
G.add_edges_from(final_arr)
I am wondering if there is a more easy way to do it networkx?
I am happy to provide more details if needed.

I believe you can use:
lst = [
['a', 'b'],
['b', 'c', 'd', 'e'],
['a', 'd'],
['b', 'e']
]
u = pd.get_dummies(pd.DataFrame(lst), prefix='', prefix_sep='').sum(level=0, axis=1)
v = u.T.dot(u)
#set 0 to lower triangular matrix
v.values[np.tril(np.ones(v.shape)).astype(np.bool)] = 0
print(v)
a b c d e
a 0 1 0 1 0
b 0 0 1 1 2
c 0 0 0 1 1
d 0 0 0 0 1
e 0 0 0 0 0
#reshape and filter only count > 0
a = v.stack()
a = a[a >= 1].rename_axis(('source', 'target')).reset_index(name='weight')
print(a)
source target weight
0 a b 1
1 a d 1
2 b c 1
3 b d 1
4 b e 2
5 c d 1
6 c e 1
7 d e 1
Create graph by from_pandas_edgelist
import networkx as nx
G = nx.from_pandas_edgelist(a, edge_attr=True)
print (nx.to_dict_of_dicts(G))
{'a': {'b': {'weight': 1}, 'd': {'weight': 1}},
'b': {'a': {'weight': 1}, 'c': {'weight': 1}, 'd': {'weight': 1}, 'e': {'weight': 2}},
'd': {'a': {'weight': 1}, 'b': {'weight': 1}, 'c': {'weight': 1}, 'e': {'weight': 1}},
'c': {'b': {'weight': 1}, 'd': {'weight': 1}, 'e': {'weight': 1}},
'e': {'b': {'weight': 2}, 'c': {'weight': 1}, 'd': {'weight': 1}}}

Splitting dictionary into existing columns

Suppose I have the dataframe pd.DataFrame({'a':nan, 'b':nan, 'c':{'a':1, 'b':2},{'a':4, 'b':7, 'c':nan}, {'a':nan, 'b':nan, 'c':{'a':6, 'b':7}}). I want to take the values from the keys in the dictionary in column c and parse them into keys a and b.
Expected output is:
a b c
0 1 2 {'a':1, 'b':2}
1 4 7 nan
2 6 7 {'a':6, 'b':7}
I know how to do this to create new columns, but that is not the task I need for this, since a and b have relevant information needing updates from c. I have not been able to find anything relevant to this task.
Any suggestions for an efficient method would be most welcome.
** EDIT **
The real problem is that I have the following dataframe, which I reduced to the above (in several, no doubt, extraneous steps):
a b c
0 nan nan [{'a':1, 'b':2}, {'a':6, 'b':7}]
1 4 7 nan
and I need to have output, in as few steps as possible, as per
a b c
0 1 2 {'a':1, 'b':2}
1 4 7 nan
2 6 7 {'a':6, 'b':7}
Thanks!

This works:
def func(x):
d = eval(x['c'])
x['a'] = d['a']
x['b'] = d['b']
return x
df = df.apply(lambda x : func(x), axis=1)

How about this:
for t in d['c'].keys():
d[t] = d['c'][t]
Here is an example:
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2}}
>>> d
{'a': '', 'b': '', 'c': {'a': 1, 'b': 2}}
>>> d.keys()
dict_keys(['a', 'b', 'c'])
>>> d['c'].keys()
dict_keys(['a', 'b'])
>>> for t in d['c'].keys():
... d[t] = d['c'][t]
...
>>> d
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2}}
>>>
We can turn it into a function:
>>> def updateDict(dict, sourceKey):
... for targetKey in dict[sourceKey].keys():
... dict[targetKey] = dict[sourceKey][targetKey]
...
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2}}
>>> updateDict(d, 'c')
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2}}
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2, 'z':1000}}
>>> updateDict(d, 'c')
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2, 'z': 1000}, 'z': 1000}
>>>

Python For each group in DataFrame create a list of dictionaries

Trying to create dictionary out of each grouping defined in column 'a' in python. Below is pandas DataFrame.
id | day | b | c
-----------------
A1 1 H 2
A1 1 C 1
A1 2 H 3
A1 2 C 5
A2 1 H 5
A2 1 C 6
A2 2 H 2
A2 2 C 1
What I am trying to accomplish is a list of dictionaries for each 'id':
id A1: [{H: 2, C: 1}, {H: 3, C: 5}]
id A2: [{H: 5, C: 6}, {H: 2, C: 1}]

A little bit long ..:-)
df.groupby(['id','day'])[['b','c']].apply(lambda x : {t[0]:t[1:][0] for t in x.values.tolist()}).groupby(level=0).apply(list)
Out[815]:
id
A1 [{'H': 2, 'C': 1}, {'H': 3, 'C': 5}]
A2 [{'H': 5, 'C': 6}, {'H': 2, 'C': 1}]
dtype: object

Let's reshape the dataframe then use groupby and to_dict:
df.set_index(['id','day','b']).unstack()['c']\
.groupby(level=0).apply(lambda x: x.to_dict('records'))
Output:
id
A1 [{'H': 2, 'C': 1}, {'H': 3, 'C': 5}]
A2 [{'H': 5, 'C': 6}, {'H': 2, 'C': 1}]
dtype: object

We can make use of dual groupby i.e
one = df.groupby(['id','day']).apply(lambda x : dict(zip(x['b'],x['c']))).reset_index()
id day 0
0 A1 1 {'C': 1, 'H': 2}
1 A1 2 {'C': 5, 'H': 3}
2 A2 1 {'C': 6, 'H': 5}
3 A2 2 {'C': 1, 'H': 2}
one.groupby('id')[0].apply(list)
id
A1 [{'C': 1, 'H': 2}, {'C': 5, 'H': 3}]
A2 [{'C': 6, 'H': 5}, {'C': 1, 'H': 2}]
Name: 0, dtype: object

Python: Read tree like data from text file and store in dictionary

I have a text file containing tree like data separated by a space with following structure:
'Number of branches' 'Name of node' ... 0 (0 means no further branch)
For example:
1 A 3 B 2 C 0
D 0
E 1 F 0
G 2 H 0
I 0
the corresponding dictionary 'tree' should be like:
tree = {'A': {'B': {'C': {},
'D': {}},
'E': {'F': {}},
'G': {'H': {},
'I': {}}}}
I think recursive way is the right way to do it but I am unable to make it work. I have following function so far:
def constructNodes(branch):
global nodes_temp
if not branch:
branch = deque(file.readline().strip().split())
node1 = branch.popleft()
nodes_temp.add(node1)
nbBranches = int(branch.popleft())
for i in xrange(nbBranches):
constructNodes(branch)
return nodes_temp
Thanks in advance for the help.

You don't need a deque to iterate through a sequence, you can use a regular python iterable. That can make things a lot more concise.
data = """
1 A 3 B 2 C 0
D 0
E 1 F 0
G 2 H 0
I 0"""
def construct_nodes(data):
return dict((next(data), construct_nodes(data))
for _ in xrange(int(next(data))))
print construct_nodes(iter(data.split()))

I think this is what you want:
from collections import deque
data = deque('''\
1 A 3 B 2 C 0
D 0
E 1 F 0
G 2 H 0
I 0'''.split())
def constructNodes():
D = {}
count = int(data.popleft())
for _ in range(count):
node = data.popleft()
D[node] = constructNodes()
return D
tree = constructNodes()
print(tree)
Output:
{'A': {'B': {'C': {}, 'D': {}}, 'G': {'H': {}, 'I': {}}, 'E': {'F': {}}}}
With some formatting:
{'A': {'B': {'C': {},
'D': {}},
'G': {'H': {},
'I': {}},
'E': {'F': {}}}}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cut Pandas dataframe based on unique values per column - python

Related

Convert multiple columns into dict after groupby in pandas

How to convert a co-occurrence matrix to networkx graph

Splitting dictionary into existing columns

Python For each group in DataFrame create a list of dictionaries

Python: Read tree like data from text file and store in dictionary

Categories

Resources