top k columns with values in pandas dataframe for every row - python

I have a pandas dataframe like the following:
A B C D
0 7 2 5 2
1 3 3 1 1
2 0 2 6 1
3 3 6 2 9
There can be 100s of columns, in the above example I have only shown 4.
I would like to extract top-k columns for each row and their values.
I can get the top-k columns using:
pd.DataFrame({n: df.T[column].nlargest(k).index.tolist() for n, column in enumerate(df.T)}).T
which, for k=3 gives:
0 1 2
0 A C B
1 A B C
2 C B D
3 D B A
But what I would like to have is:
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Is there a pand(a)oic way to achieve this?

You can use numpy solution:
numpy.argsort for columns names
array already sort (thanks Jeff), need values by indices
interweave for new array
DataFrame constructor
k = 3
vals = df.values
arr1 = np.argsort(-vals, axis=1)
a = df.columns[arr1[:,:k]]
b = vals[np.arange(len(df.index))[:,None], arr1][:,:k]
c = np.empty((vals.shape[0], 2 * k), dtype=a.dtype)
c[:,0::2] = a
c[:,1::2] = b
print (c)
[['A' 7 'C' 5 'B' 2]
['A' 3 'B' 3 'C' 1]
['C' 6 'B' 2 'D' 1]
['D' 9 'B' 6 'A' 3]]
df = pd.DataFrame(c)
print (df)
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3

>>> def foo(x):
... r = []
... for p in zip(list(x.index), list(x)):
... r.extend(p)
... return r
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Or, using list comprehension:
>>> def foo(x):
... return [j for i in zip(list(x.index), list(x)) for j in i]
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3

This does the job efficiently : It uses argpartition that found the n biggest in O(n), then sort only them.
values=df.values
n,m=df.shape
k=4
I,J=mgrid[:n,:m]
I=I[:,:1]
if k<m: J=(-values).argpartition(k)[:,:k]
values=values[I,J]
names=np.take(df.columns,J)
J2=(-values).argsort()
names=names[I,J2]
values=values[I,J2]
names_and_values=np.empty((n,2*k),object)
names_and_values[:,0::2]=names
names_and_values[:,1::2]=values
result=pd.DataFrame(names_and_values)
For
0 1 2 3 4 5
0 A 7 C 5 B 2
1 B 3 A 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3

Related

Pandas all possible combinations of two columns

I have such a table:
A B C
1 1 2 3
2 5 6 7
I want to extract all combination with two columns as separate data frame:
A B
1 1 2
2 5 6
A C
1 1 3
2 5 7
B C
1 2 3
2 6 7
How to accomplish that?
Use itertools.combinations in loop with DataFrame.loc:
from itertools import combinations
for c in combinations(df.columns, 2):
print (df.loc[:, c])
A B
1 1 2
2 5 6
A C
1 1 3
2 5 7
B C
1 2 3
2 6 7
If need list of DataFrames:
L = [df.loc[:, c] for c in combinations(df.columns, 2)]
print (L)
[ A B
1 1 2
2 5 6, A C
1 1 3
2 5 7, B C
1 2 3
2 6 7]
Or dict of DataFrames:
d = {f'{"_".join(c)}': df.loc[:, c]for c in combinations(df.columns, 2)}
print (d)
{'A_B': A B
1 1 2
2 5 6, 'A_C': A C
1 1 3
2 5 7, 'B_C': B C
1 2 3
2 6 7}

How to add occurrence of each entry to pandas data frame?

Let df1 be a pandas data frame with a column of letters and a column of integers:
>>> k = pd.DataFrame({
"a": numpy.random.choice([i for i in "abcde"], 10),
"b": numpy.random.choice(range(5), 10)
})
>>> k
a b
0 a 1
1 c 2
2 e 1
3 b 3
4 c 2
5 d 2
6 e 2
7 c 3
8 b 0
9 a 3
Using value_counts(), the counts of the letters are found:
>>> counts = k["a"].value_counts()
>>> counts
c 3
e 2
b 2
a 2
d 1
Name: a, dtype: int64
How to add each occurrance to the respective row? It should result in
>>> k
a b count
0 a 1 2
1 c 2 3
2 e 1 2
[...]
9 a 3 2
Here's an alternate to using transform:
First, you can extract the value_counts() into a dataframe:
mycounts = k['a'].value_counts().rename_axis('a').reset_index(name = 'counts')
The step above is useful in many different scenarios (and good to know in general).
Then, a left-join will put the value counts into the original dataframe:
k = k.merge(mycounts, left_on = 'a', right_on = 'a', how = 'left')
You can try with transform
k['count']=k.groupby('a').a.transform('count')
k
Out[330]:
a b count
0 d 1 2
1 e 3 3
2 e 3 3
3 d 3 2
4 b 4 4
5 b 1 4
6 b 0 4
7 a 2 1
8 b 0 4
9 e 4 3

Python Pandas: Apply a value to groupby reult

Having the following data frames:
d1 = pd.DataFrame({'A':[1,1,1,2,2,2,3,3,3]})
A C
0 1 'x'
1 1 'x'
2 1 'x'
3 2 'y'
4 2 'y'
5 2 'y'
6 3 'z'
7 3 'z'
8 3 'z'
d2 = pd.DataFrame({'B':['a','b','c']})
0 a
1 b
2 c
I would like to apply the values of d2 to the groups of A and C of d1 so the resulting DF would look like this:
A C B
0 1 x a
1 1 x a
2 1 x a
3 2 y b
4 2 y b
5 2 y b
6 3 z c
7 3 z c
8 3 z c
How can I achieve this using Pandas?
If possible you can use Series.map with enumerate object converted to dictionary:
d1['b'] = d1['A'].map(dict(enumerate(d2['B'], 1)))
print (d1)
A b
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
General solutions with factorize for numeric values started by 0 and mapped to dictionary:
d = dict(zip(*pd.factorize(d2['B'])))
d1['B'] = pd.Series(pd.factorize(d1['A'])[0], index=d1.index).map(d)
#alternative
#d1['B'] = d1.groupby('A', sort=False).ngroup().map(d)
print (d1)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
To take duplicate categories in your d2 into account, we will use drop_duplicates with Series.map:
values = d2['B'].drop_duplicates()
values.index = values.index + 1
d1['B'] = d1['A'].map(values)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
You can use df.merge here.
d2.index+=1
d1.merge(d2,left_on='A',right_index=True)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c

Pandas - combine two Series by all unique combinations

Let's say I have the following series:
0 A
1 B
2 C
dtype: object
0 1
1 2
2 3
3 4
dtype: int64
How can I merge them to create an empty dataframe with every possible combination of values, like this:
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
Assuming the 2 series are s and s1, use itertools.product() which gives a cartesian product of input iterables :
import itertools
df = pd.DataFrame(list(itertools.product(s,s1)),columns=['letter','number'])
print(df)
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As of Pandas 1.2.0, there is a how='cross' option in pandas.merge() that produces the Cartesian product of the columns.
import pandas as pd
letters = pd.DataFrame({'letter': ['A','B','C']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As an additional bonus, this function makes it easy to do so with more than one column.
letters = pd.DataFrame({'letterA': ['A','B','C'],
'letterB': ['D','D','E']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letterA letterB number
0 A D 1
1 A D 2
2 A D 3
3 A D 4
4 B D 1
5 B D 2
6 B D 3
7 B D 4
8 C E 1
9 C E 2
10 C E 3
11 C E 4
If you have 2 Series s1 and s2.
you can do this:
pd.DataFrame(index=s1,columns=s2).unstack().reset_index()[["s1","s2"]]
It will give you the follow
s1 s2
0 A 1
1 B 1
2 C 1
3 A 2
4 B 2
5 C 2
6 A 3
7 B 3
8 C 3
9 A 4
10 B 4
11 C 4
You can use pandas.MultiIndex.from_product():
import pandas as pd
pd.DataFrame(
index = pd.MultiIndex
.from_product(
[
['A', 'B', 'C'],
[1, 2, 3, 4]
],
names = ['letters', 'numbers']
)
)
which results in a hierarchical structure:
letters numbers
A 1
2
3
4
B 1
2
3
4
C 1
2
3
4
and you can further call .reset_index() to get ungrouped results:
letters numbers
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
(However I find #NickCHK's answer to be the best)

Convert an nxn matrix to a pandas dataframe

I have an n by n data in csv in the following format
- A B C D
A 0 1 2 4
B 2 0 3 1
C 1 0 0 5
D 2 5 4 0
...
I would like to read it and convert to a 3D pandas dataframe in the following format:
Origin Dest Distance
A A 0
A B 1
A C 2
...
What is the best way to convert it? In the worst case, I'll write a for loop to read each line and append the transpose of it but there must be an easier way. Any help would be appreciated.
Use pd.melt()
Assuming, your dataframe looks like
In [479]: df
Out[479]:
- A B C D
0 A 0 1 2 4
1 B 2 0 3 1
2 C 1 0 0 5
3 D 2 5 4 0
In [480]: pd.melt(df, id_vars=['-'], value_vars=df.columns.values.tolist()[1:],
.....: var_name='Dest', value_name='Distance')
Out[480]:
- Dest Distance
0 A A 0
1 B A 2
2 C A 1
3 D A 2
4 A B 1
5 B B 0
6 C B 0
7 D B 5
8 A C 2
9 B C 3
10 C C 0
11 D C 4
12 A D 4
13 B D 1
14 C D 5
15 D D 0
Where df.columns.values.tolist()[1:] are remaining columns ['A', 'B', 'C', 'D']
To replace '-' with 'Origin', you could use dataframe.rename(columns={...})
pd.melt(df, id_vars=['-'], value_vars=df.columns.values.tolist()[1:],
var_name='Dest', value_name='Distance').rename(columns={'-': 'Origin'})

Categories