I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
numba
from numba import njit
#njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)
Related
I have a very large dataset containing the members in each team in each month. I want to find additions and deletions to each team. Because my dataset is very big, I'm trying to use in-built functions as much as possible.
My dataset looks like this:
month team members
0 0 A X, Y, Z
1 1 A X, Y
2 2 A W, X, Y
3 0 B D, E
4 1 B D, E, F
5 2 B F
It's generated by the following code:
num_months = 3
num_teams = 2
obs = num_months*num_teams
df = pd.DataFrame({"month": [i % num_months for i in range(obs)],
"team": ['AB'[i // num_months] for i in range(obs)],
"members": ["X, Y, Z", "X, Y", "W, X, Y", "D, E", "D, E, F", "F"]})
df
The result should be like this:
month team members additions deletions
0 0 A X, Y, Z None None
1 1 A X, Y None Z
2 2 A W, X, Y W None
3 0 B D, E None None
4 1 B D, E, F F None
5 2 B F None D, E
or in Python code
df = pd.DataFrame({"month": [i % num_months for i in range(obs)],
"team": ['AB'[i // num_months] for i in range(obs)],
"members": ["X, Y, Z", "X, Y", "W, X, Y", "D, E", "D, E, F", "F"],
"additions": [None, None, "W", None, "F", None],
"deletions": [None, "Z", None, None, None, "D, E"]
})
A technique that immediately comes to mind is to create a new column which shows the lagged value of members in each group, followed by taking the set difference (both ways) between both columns.
Is there a way to take set differences between columns using pandas inbuilt functions?
Are there other techniques I should try?
Using set, groupby, apply, and shift.
For efficiency:
Convert members to set type because - is an unsupported operand, which will cause a TypeError.
Leave additions and deletions as set type
Using apply
With a dataframe of 60000 rows:
91.4 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# clean the members column
df.members = df.members.str.replace(' ', '').str.split(',').map(set)
# create del and add
df['deletions'] = df.groupby('team')['members'].apply(lambda x: x.shift() - x)
df['additions'] = df.groupby('team')['members'].apply(lambda x: x - x.shift())
# result
month team members additions deletions
0 A {Z, X, Y} NaN NaN
1 A {X, Y} {} {Z}
2 A {W, X, Y} {W} {}
0 B {D, E} NaN NaN
1 B {D, F, E} {F} {}
2 B {F} {} {D, E}
More Efficiently
pandas.DataFrame.diff
With a dataframe of 60000 rows:
60.7 ms ± 3.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['deletions'] = df.groupby('team')['members'].diff(periods=-1).shift()
df['additions'] = df.groupby('team')['members'].diff()
Here is one way to do it. Not sure if this is the most efficient. I've found that is not that straightforward to optimize pandas performance by just looking at the code.
The strategy I've adopted is to calculate the deletions and additions separately and then somehow merge that information back into the original DataFrame.
This solution assumes that the input DataFrame is sorted by (team, month). If not, you'd need to do that first.
def set_diff_adds(x):
retval = {}
for m, b, a in zip(x.month.iloc[1:], x.members.iloc[1:], x.members):
retval[m] = (set(b.replace(' ', '').split(',')) -
set(a.replace(' ', '').split(',')))
return retval
def set_diff_dels(x):
retval = {}
for m, b, a in zip(x.month.iloc[1:], x.members.iloc[1:], x.members):
retval[m] = (set(a.replace(' ', '').split(',')) -
set(b.replace(' ', '').split(',')))
return retval
deletions = df.groupby('team').apply(set_diff_dels).apply(pd.Series)
deletions.columns.set_names('month', inplace=True)
deletions = deletions.stack().to_frame('deletions').reset_index()
merged = df.merge(deletions, how='outer')
additions = df.groupby('team').apply(set_diff_adds).apply(pd.Series)
additions.columns.set_names('month', inplace=True)
additions = additions.stack().to_frame('additions').reset_index()
merged = merged.merge(additions, how='outer')
merged
month team members deletions additions
0 0 A X, Y, Z NaN NaN
1 1 A X, Y {Z} {}
2 2 A W, X, Y {} {W}
3 0 B D, E NaN NaN
4 1 B D, E, F {} {F}
5 2 B F {D, E} {}
I have a pandas dataframe which looks like this:
colour points
0 red 1
1 yellow 10
2 black -3
Then I'm trying to do the following algorithm:
combos = []
points = []
for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].values[i1]
colour_secondary = df['colour'].values[i2]
combo = colour_main + "_" + colour_secondary
point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 + point2
combos.append(combo)
points.append(new_points)
df_new = pd.DataFrame({'colours': combos,
'points': points})
print(df_new)
I want to get all combinations and sum points:
if the colour is used as main I want to sum his value
if the colour is used as a secondary I want to sum opposite value
Example:
red_yellow = 1 + (-10) = -9
red_black = 1 + ( +3) = 4
black_red = -3 + ( -1) = -4
The output I currently get:
colours points
0 red_red 2
1 red_yellow 11
2 red_black -2
3 yellow_red 11
4 yellow_yellow 20
5 yellow_black 7
6 black_red -2
7 black_yellow 7
8 blac_kblack -6
The output I'm looking for:
red_yellow -9
red_black 4
yellow_red 9
yellow_black 13
black_red -4
black_yellow -13
I don't know how to apply my logic to this code, also I bet there is a more simplest way to get all combinations without doing two loops, but currently, that's the only thing that comes to my mind.
I would like to:
get deserved output
improve the performance in cases when we get like 20 input colours
remove duplicates like red_red
Here is a timeit comparison of a few alternatives.
| method | ms per loop |
|--------------------+-------------|
| alt2 | 2.36 |
| using_concat | 3.26 |
| using_double_merge | 22.4 |
| orig | 22.6 |
| alt | 45.8 |
The timeit results were generated using IPython:
In [138]: df = make_df(20)
In [143]: %timeit alt2(df)
100 loops, best of 3: 2.36 ms per loop
In [140]: %timeit orig(df)
10 loops, best of 3: 22.6 ms per loop
In [142]: %timeit alt(df)
10 loops, best of 3: 45.8 ms per loop
In [169]: %timeit using_double_merge(df)
10 loops, best of 3: 22.4 ms per loop
In [170]: %timeit using_concat(df)
100 loops, best of 3: 3.26 ms per loop
import numpy as np
import pandas as pd
def alt(df):
df['const'] = 1
result = pd.merge(df, df, on='const', how='outer')
result = result.loc[(result['colour_x'] != result['colour_y'])]
result['color'] = result['colour_x'] + '_' + result['colour_y']
result['points'] = result['points_x'] - result['points_y']
result = result[['color', 'points']]
return result
def alt2(df):
points = np.add.outer(df['points'], -df['points'])
color = pd.MultiIndex.from_product([df['colour'], df['colour']])
mask = color.labels[0] != color.labels[1]
color = color.map('_'.join)
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
return result
def orig(df):
combos = []
points = []
for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].iloc[i1]
colour_secondary = df['colour'].iloc[i2]
if colour_main != colour_secondary:
combo = colour_main + "_" + colour_secondary
point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 - point2
combos.append(combo)
points.append(new_points)
return pd.DataFrame({'color':combos, 'points':points})
def using_concat(df):
"""https://stackoverflow.com/a/51641085/190597 (RafaelC)"""
d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2 = pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
return df2
def using_double_merge(df):
"""https://stackoverflow.com/a/51641007/190597 (sacul)"""
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))
new['points_y'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
return new[['colours', 'sum']]
def make_df(N):
df = pd.DataFrame({'colour': np.arange(N),
'points': np.random.randint(10, size=N)})
df['colour'] = df['colour'].astype(str)
return df
The main idea in alt2 is to use np.add_outer to construct an addition table
out of df['points']:
In [149]: points = np.add.outer(df['points'], -df['points'])
In [151]: points
Out[151]:
array([[ 0, -9, 4],
[ 9, 0, 13],
[ -4, -13, 0]])
ravel is used to make the array 1-dimensional:
In [152]: points.ravel()
Out[152]: array([ 0, -9, 4, 9, 0, 13, -4, -13, 0])
and the color combinations are generated with pd.MultiIndex.from_product:
In [153]: color = pd.MultiIndex.from_product([df['colour'], df['colour']])
In [155]: color = color.map('_'.join)
In [156]: color
Out[156]:
Index(['red_red', 'red_yellow', 'red_black', 'yellow_red', 'yellow_yellow',
'yellow_black', 'black_red', 'black_yellow', 'black_black'],
dtype='object')
A mask is generated to remove duplicates:
mask = color.labels[0] != color.labels[1]
and then the result is generated from these parts:
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
The idea behind alt is explained in my original answer, here.
This is a bit long-winded, but gets you the output you want:
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))
new['points_x'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
>>> new
level_0 level_1 sum colours
3 yellow red -9 yellow_red
6 black red 4 black_red
1 red yellow 9 red_yellow
7 black yellow 13 black_yellow
2 red black -4 red_black
5 yellow black -13 yellow_black
d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2= pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
comb values
0 red_yellow -9
1 red_black 4
2 yellow_black 13
0 yellow_red 9
1 black_red -4
2 black_yellow -13
You have to change this line in your code
new_points = point1 + point2
to this
new_points = point1 - point2
I'm trying to create a pandas dataframe like this:
x2 x3
0 3.536220 0.681269
1 0.681269 3.536220
2 -0.402380 2.303833
3 2.303833 -0.402380
4 2.032329 3.334412
5 3.334412 2.032329
6 0.371338 5.879732
. . .
So x2 is a column of random numbers, and x3 has the values of row 0 and 1 in x2 swapped, the values of 2 and 3 swapped, and so on. My current code is like this:
import numpy as np
import pandas as pd
x2 = pd.Series(np.random.normal(loc = 2, scale = 2.5, size = 1000))
x3 = pd.Series([x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)])
df = pd.DataFrame({'x2': x2, 'x3': x3})
I'm wondering if there is any faster or more elegant way, particularly if I want to have many rows (e.g. 1 million?) or do this over and over again (e.g. Monte Carlo simulation)?
Instead of
[x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)]
you could use
def swap(arr):
result = np.empty_like(arr)
result[::2] = arr[1::2]
result[1::2] = arr[::2]
return result
For a sequence of length 1000, using swap is over 3000x faster:
In [84]: %timeit [x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)]
100 loops, best of 3: 12.7 ms per loop
In [98]: %timeit swap(x2.values)
100000 loops, best of 3: 3.82 µs per loop
import numpy as np
import pandas as pd
np.random.seed(2017)
x2 = pd.Series(np.random.normal(loc = 2, scale = 2.5, size = 1000))
x3 = [x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)]
def swap(arr):
result = np.empty_like(arr)
result[::2] = arr[1::2]
result[1::2] = arr[::2]
return result
df = pd.DataFrame({'x2': x2, 'x3': x3, 'x4': swap(x2.values)})
print(df.head())
prints
x2 x3 x4
0 -0.557363 1.649005 1.649005
1 1.649005 -0.557363 -0.557363
2 2.497731 3.433690 3.433690
3 3.433690 2.497731 2.497731
4 1.013555 0.679394 0.679394
consider the array x and delta variable d
np.random.seed([3,1415])
x = np.random.randint(100, size=10)
d = 10
For each element in x, I want to count how many other elements in each are within delta d distance away.
So x looks like
print(x)
[11 98 74 90 15 55 13 11 13 26]
The results should be
[5 2 1 2 5 1 5 5 5 1]
what I've tried
Strategy:
Use broadcasting to take the outer difference
Absolute value of outer difference
sum how many exceed threshold
(np.abs(x[:, None] - x) <= d).sum(-1)
[5 2 1 2 5 1 5 5 5 1]
This works great. However, it doesn't scale. That outer difference is O(n^2) time. How can I get the same solution that doesn't scale with quadratic time?
Listed in this post are two more variants based on the searchsorted strategy from OP's answer post.
def pir3(a,d): # Short & less efficient
sidx = a.argsort()
p1 = a.searchsorted(a+d,'right',sorter=sidx)
p2 = a.searchsorted(a-d,sorter=sidx)
return p1 - p2
def pir4(a, d): # Long & more efficient
s = a.argsort()
y = np.empty(s.size,dtype=np.int64)
y[s] = np.arange(s.size)
a_ = a[s]
return (
a_.searchsorted(a_ + d, 'right')
- a_.searchsorted(a_ - d)
)[y]
The more efficient approach derives the efficient idea to get s.argsort() from this post.
Runtime test -
In [155]: # Inputs
...: a = np.random.randint(0,1000000,(10000))
...: d = 10
In [156]: %timeit pir2(a,d) ## piRSquared's post solution
...: %timeit pir3(a,d)
...: %timeit pir4(a,d)
...:
100 loops, best of 3: 2.43 ms per loop
100 loops, best of 3: 4.44 ms per loop
1000 loops, best of 3: 1.66 ms per loop
Strategy
Since x is not necessarily sorted, we'll sort it and track the sorting permutation via argsort so we can reverse the permutation.
We'll use np.searchsorted on x with x - d to find the starting place for when values of x start to exceed x - d.
Do it again on the other side except we'll have to use the np.searchsorted parameter side='right' and using x + d
Take the difference between right and left searchsorts to calculate number of elements that are within +/- d of each element
Use argsort to reverse the sorting permutation
define method presented in question as pir1
def pir1(a, d):
return (np.abs(a[:, None] - a) <= d).sum(-1)
We'll define a new function pir2
def pir2(a, d):
s = x.argsort()
a_ = a[s]
return (
a_.searchsorted(a_ + d, 'right')
- a_.searchsorted(a_ - d)
)[s.argsort()]
demo
pir1(x, d)
[5 2 1 2 5 1 5 5 5 1]
pir1(x, d)
[5 2 1 2 5 1 5 5 5 1]
timing
pir2 is the clear winner!
code
functions
def pir1(a, d):
return (np.abs(a[:, None] - a) <= d).sum(-1)
def pir2(a, d):
s = x.argsort()
a_ = a[s]
return (
a_.searchsorted(a_ + d, 'right')
- a_.searchsorted(a_ - d)
)[s.argsort()]
#######################
# From Divakar's post #
#######################
def pir3(a,d): # Short & less efficient
sidx = a.argsort()
p1 = a.searchsorted(a+d,'right',sorter=sidx)
p2 = a.searchsorted(a-d,sorter=sidx)
return p1 - p2
def pir4(a, d): # Long & more efficient
s = a.argsort()
y = np.empty(s.size,dtype=np.int64)
y[s] = np.arange(s.size)
a_ = a[s]
return (
a_.searchsorted(a_ + d, 'right')
- a_.searchsorted(a_ - d)
)[y]
test
from timeit import timeit
results = pd.DataFrame(
index=np.arange(1, 50),
columns=['pir%s' %i for i in range(1, 5)])
for i in results.index:
np.random.seed([3,1415])
x = np.random.randint(1000000, size=i)
for j in results.columns:
setup = 'from __main__ import x, {}'.format(j)
results.loc[i, j] = timeit('{}(x, 10)'.format(j), setup=setup, number=10000)
results.plot()
extended out to larger arrays
got rid of pir1
from timeit import timeit
results = pd.DataFrame(
index=np.arange(1, 11) * 1000,
columns=['pir%s' %i for i in range(2, 5)])
for i in results.index:
np.random.seed([3,1415])
x = np.random.randint(1000000, size=i)
for j in results.columns:
setup = 'from __main__ import x, {}'.format(j)
results.loc[i, j] = timeit('{}(x, 10)'.format(j), setup=setup, number=100)
results.insert(0, 'pir1', 0)
results.plot()
I have a 2-d dictionary in the following format:
myDict = {('a','b'):10, ('a','c'):20, ('a','d'):30, ('b','c'):40, ('b','d'):50,('c','d'):60}
How can I write this into a tab-delimited file so that the file contains the following. While filling a tuple (x, y) will fill two locations: (x,y) and (y,x). (x,x) is always 0.
The output would be :
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
PS: If somehow the dictionary can be converted into a dataframe (using pandas) then it can be easily written into a file using pandas function
You can do this with the lesser-known align method and a little unstack magic:
In [122]: s = Series(myDict, index=MultiIndex.from_tuples(myDict))
In [123]: df = s.unstack()
In [124]: lhs, rhs = df.align(df.T)
In [125]: res = lhs.add(rhs, fill_value=0).fillna(0)
In [126]: res
Out[126]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Finally, to write this to a CSV file, use the to_csv method:
In [128]: res.to_csv('res.csv', sep='\t')
In [129]: !cat res.csv
a b c d
a 0.0 10.0 20.0 30.0
b 10.0 0.0 40.0 50.0
c 20.0 40.0 0.0 60.0
d 30.0 50.0 60.0 0.0
If you want to keep things as integers, cast using DataFrame.astype(), like so:
In [137]: res.astype(int).to_csv('res.csv', sep='\t')
In [138]: !cat res.csv
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
(It was cast to float because of the intermediate step of filling in nan values where indices from one frame were missing from the other)
#Dan Allan's answer using combine_first is nice:
In [130]: df.combine_first(df.T).fillna(0)
Out[130]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
Timings:
In [134]: timeit df.combine_first(df.T).fillna(0)
100 loops, best of 3: 2.01 ms per loop
In [135]: timeit lhs, rhs = df.align(df.T); res = lhs.add(rhs, fill_value=0).fillna(0)
1000 loops, best of 3: 1.27 ms per loop
Those timings are probably a bit polluted by construction costs, so what do things look like with some really huge frames?
In [143]: df = DataFrame({i: randn(1e7) for i in range(1, 11)})
In [144]: df2 = DataFrame({i: randn(1e7) for i in range(10)})
In [145]: timeit lhs, rhs = df.align(df2); res = lhs.add(rhs, fill_value=0).fillna(0)
1 loops, best of 3: 4.41 s per loop
In [146]: timeit df.combine_first(df2).fillna(0)
1 loops, best of 3: 2.95 s per loop
DataFrame.combine_first() is faster for larger frames.
In [49]: data = map(list, zip(*myDict.keys())) + [myDict.values()]
In [50]: df = DataFrame(zip(*data)).set_index([0, 1])[2].unstack()
In [52]: df.combine_first(df.T).fillna(0)
Out[52]:
a b c d
a 0 10 20 30
b 10 0 40 50
c 20 40 0 60
d 30 50 60 0
For posterity: If you are just tuning in, check out Phillip Cloud's answer below for a neater way to construct df.
Not as elegant as I'd like (and not using pandas) but until you find something better:
adj = dict()
for ((u, v), w) in myDict.items():
if u not in adj: adj[u] = dict()
if v not in adj: adj[v] = dict()
adj[u][v] = adj[v][u] = w
keys = adj.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
try:
return str(adj[u][v])
except KeyError:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
or equivalently (if you don't want to construct the adjacency matrix):
k = dict()
for ((u, v), w) in myDict.items():
k[u] = k[v] = True
keys = k.keys()
print '\t' + '\t'.join(keys)
for u in keys:
def f(v):
if (u, v) in myDict:
return str(myDict[(u, v)])
elif (v, u) in myDict:
return str(myDict[(v, u)])
else:
return "0"
print u + '\t' + '\t'.join(f(v) for v in keys)
Got it working using pandas package.
#Find all column names
z = []
[z.extend(x) for x in myDict.keys()]
colnames = sorted(set(z))
#Create an empty DataFrame using pandas
myDF = DataFrame(index= colnames, columns = colnames )
myDF = myDF.fillna(0) #Initialize with zeros
#Fill each item one by one
for val in myDict:
myDF[val[0]][val[1]] = myDict[val]
myDF[val[1]][val[0]] = myDict[val]
#Write to a file
outfilename = "matrixCooccurence.txt"
myDF.to_csv(outfilename, sep="\t", index=True, header=True, index_label = "features" )