Having the following data frames:
d1 = pd.DataFrame({'A':[1,1,1,2,2,2,3,3,3]})
A C
0 1 'x'
1 1 'x'
2 1 'x'
3 2 'y'
4 2 'y'
5 2 'y'
6 3 'z'
7 3 'z'
8 3 'z'
d2 = pd.DataFrame({'B':['a','b','c']})
0 a
1 b
2 c
I would like to apply the values of d2 to the groups of A and C of d1 so the resulting DF would look like this:
A C B
0 1 x a
1 1 x a
2 1 x a
3 2 y b
4 2 y b
5 2 y b
6 3 z c
7 3 z c
8 3 z c
How can I achieve this using Pandas?
If possible you can use Series.map with enumerate object converted to dictionary:
d1['b'] = d1['A'].map(dict(enumerate(d2['B'], 1)))
print (d1)
A b
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
General solutions with factorize for numeric values started by 0 and mapped to dictionary:
d = dict(zip(*pd.factorize(d2['B'])))
d1['B'] = pd.Series(pd.factorize(d1['A'])[0], index=d1.index).map(d)
#alternative
#d1['B'] = d1.groupby('A', sort=False).ngroup().map(d)
print (d1)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
To take duplicate categories in your d2 into account, we will use drop_duplicates with Series.map:
values = d2['B'].drop_duplicates()
values.index = values.index + 1
d1['B'] = d1['A'].map(values)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
You can use df.merge here.
d2.index+=1
d1.merge(d2,left_on='A',right_index=True)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
Related
I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!
groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I have such a table:
A B C
1 1 2 3
2 5 6 7
I want to extract all combination with two columns as separate data frame:
A B
1 1 2
2 5 6
A C
1 1 3
2 5 7
B C
1 2 3
2 6 7
How to accomplish that?
Use itertools.combinations in loop with DataFrame.loc:
from itertools import combinations
for c in combinations(df.columns, 2):
print (df.loc[:, c])
A B
1 1 2
2 5 6
A C
1 1 3
2 5 7
B C
1 2 3
2 6 7
If need list of DataFrames:
L = [df.loc[:, c] for c in combinations(df.columns, 2)]
print (L)
[ A B
1 1 2
2 5 6, A C
1 1 3
2 5 7, B C
1 2 3
2 6 7]
Or dict of DataFrames:
d = {f'{"_".join(c)}': df.loc[:, c]for c in combinations(df.columns, 2)}
print (d)
{'A_B': A B
1 1 2
2 5 6, 'A_C': A C
1 1 3
2 5 7, 'B_C': B C
1 2 3
2 6 7}
Let's say I have the following series:
0 A
1 B
2 C
dtype: object
0 1
1 2
2 3
3 4
dtype: int64
How can I merge them to create an empty dataframe with every possible combination of values, like this:
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
Assuming the 2 series are s and s1, use itertools.product() which gives a cartesian product of input iterables :
import itertools
df = pd.DataFrame(list(itertools.product(s,s1)),columns=['letter','number'])
print(df)
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As of Pandas 1.2.0, there is a how='cross' option in pandas.merge() that produces the Cartesian product of the columns.
import pandas as pd
letters = pd.DataFrame({'letter': ['A','B','C']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letter number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
As an additional bonus, this function makes it easy to do so with more than one column.
letters = pd.DataFrame({'letterA': ['A','B','C'],
'letterB': ['D','D','E']})
numbers = pd.DataFrame({'number': [1,2,3,4]})
together = pd.merge(letters, numbers, how = 'cross')
letterA letterB number
0 A D 1
1 A D 2
2 A D 3
3 A D 4
4 B D 1
5 B D 2
6 B D 3
7 B D 4
8 C E 1
9 C E 2
10 C E 3
11 C E 4
If you have 2 Series s1 and s2.
you can do this:
pd.DataFrame(index=s1,columns=s2).unstack().reset_index()[["s1","s2"]]
It will give you the follow
s1 s2
0 A 1
1 B 1
2 C 1
3 A 2
4 B 2
5 C 2
6 A 3
7 B 3
8 C 3
9 A 4
10 B 4
11 C 4
You can use pandas.MultiIndex.from_product():
import pandas as pd
pd.DataFrame(
index = pd.MultiIndex
.from_product(
[
['A', 'B', 'C'],
[1, 2, 3, 4]
],
names = ['letters', 'numbers']
)
)
which results in a hierarchical structure:
letters numbers
A 1
2
3
4
B 1
2
3
4
C 1
2
3
4
and you can further call .reset_index() to get ungrouped results:
letters numbers
0 A 1
1 A 2
2 A 3
3 A 4
4 B 1
5 B 2
6 B 3
7 B 4
8 C 1
9 C 2
10 C 3
11 C 4
(However I find #NickCHK's answer to be the best)
I have a pandas dataframe like the following:
A B C D
0 7 2 5 2
1 3 3 1 1
2 0 2 6 1
3 3 6 2 9
There can be 100s of columns, in the above example I have only shown 4.
I would like to extract top-k columns for each row and their values.
I can get the top-k columns using:
pd.DataFrame({n: df.T[column].nlargest(k).index.tolist() for n, column in enumerate(df.T)}).T
which, for k=3 gives:
0 1 2
0 A C B
1 A B C
2 C B D
3 D B A
But what I would like to have is:
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Is there a pand(a)oic way to achieve this?
You can use numpy solution:
numpy.argsort for columns names
array already sort (thanks Jeff), need values by indices
interweave for new array
DataFrame constructor
k = 3
vals = df.values
arr1 = np.argsort(-vals, axis=1)
a = df.columns[arr1[:,:k]]
b = vals[np.arange(len(df.index))[:,None], arr1][:,:k]
c = np.empty((vals.shape[0], 2 * k), dtype=a.dtype)
c[:,0::2] = a
c[:,1::2] = b
print (c)
[['A' 7 'C' 5 'B' 2]
['A' 3 'B' 3 'C' 1]
['C' 6 'B' 2 'D' 1]
['D' 9 'B' 6 'A' 3]]
df = pd.DataFrame(c)
print (df)
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
>>> def foo(x):
... r = []
... for p in zip(list(x.index), list(x)):
... r.extend(p)
... return r
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Or, using list comprehension:
>>> def foo(x):
... return [j for i in zip(list(x.index), list(x)) for j in i]
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
This does the job efficiently : It uses argpartition that found the n biggest in O(n), then sort only them.
values=df.values
n,m=df.shape
k=4
I,J=mgrid[:n,:m]
I=I[:,:1]
if k<m: J=(-values).argpartition(k)[:,:k]
values=values[I,J]
names=np.take(df.columns,J)
J2=(-values).argsort()
names=names[I,J2]
values=values[I,J2]
names_and_values=np.empty((n,2*k),object)
names_and_values[:,0::2]=names
names_and_values[:,1::2]=values
result=pd.DataFrame(names_and_values)
For
0 1 2 3 4 5
0 A 7 C 5 B 2
1 B 3 A 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
I want to create a new column with column name for the max value by index.
Tie would include both columns.
A B C D
TRDNumber
ALB2008081610 3 1 1 1
ALB200808167 1 3 4 1
ALB200808168 3 1 3 1
ALB200808171 2 2 5 1
ALB2008081710 1 2 2 5
Desired output
A B C D Best
TRDNumber
ALB2008081610 3 1 1 1 A
ALB200808167 1 3 4 1 C
ALB200808168 3 1 3 1 A,C
ALB200808171 2 2 5 1 C
ALB2008081710 1 2 2 5 D
I have tried the following code
df.groupby(['TRDNumber'])[cols].max()
you can do:
>>> f = lambda r: ','.join(df.columns[r])
>>> df.eq(df.max(axis=1), axis=0).apply(f, axis=1)
TRDNumber
ALB2008081610 A
ALB200808167 C
ALB200808168 A,C
ALB200808171 C
ALB2008081710 D
dtype: object
>>> df['best'] = _
>>> df
A B C D best
TRDNumber
ALB2008081610 3 1 1 1 A
ALB200808167 1 3 4 1 C
ALB200808168 3 1 3 1 A,C
ALB200808171 2 2 5 1 C
ALB2008081710 1 2 2 5 D