Pandas generate numeric sequence for groups in new column - python

I am working on a data frame as below,
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','C','C','C','C'],
'B':['a','a','b','a','b','a','b','c','c'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B b
5 C a
6 C b
7 C c
8 C c
I want to create a new column with the sequence value for Column B subgroups based on Column A groups like below
A B C
0 A a 1
1 A a 1
2 A b 2
3 B a 1
4 B b 2
5 C a 3
6 C b 1
7 C c 2
8 C c 2
I tried this , but does not give me desired output
df['C'] = df.groupby(['A','B']).cumcount()+1

IIUC, I think you want something like this:
df['C'] = df.groupby('A')['B'].transform(lambda x: (x != x.shift()).cumsum())
Output:
A B C
0 A a 1
1 A a 1
2 A b 2
3 B a 1
4 B b 2
5 C c 1
6 C b 2
7 C c 3
8 C c 3

Related

How to groupby in different keys

I have dflike following
A B C
a a d
a b d
a b e
b c e
b c f
When I try
df.groupby(A).size()
A
a 3
b 2
df.groupby(B).size()
B
a 1
b 2
c 2
my desired result is aggregated one
A B
a 3 1
b 2 2
c 0 2
Are there any way to achieve this result ?
If someone has opinion,please let me know.
Thanks
melt + crosstab
s = df[['A','B']].melt()
out = pd.crosstab(s['value'],s['variable'])
out
Out[18]:
variable A B
value
a 3 1
b 2 2
c 0 2
Or
df[['A','B']].apply(pd.Series.value_counts)
Out[19]:
A B
a 3.0 1
b 2.0 2
c NaN 2

Python Pandas: Apply a value to groupby reult

Having the following data frames:
d1 = pd.DataFrame({'A':[1,1,1,2,2,2,3,3,3]})
A C
0 1 'x'
1 1 'x'
2 1 'x'
3 2 'y'
4 2 'y'
5 2 'y'
6 3 'z'
7 3 'z'
8 3 'z'
d2 = pd.DataFrame({'B':['a','b','c']})
0 a
1 b
2 c
I would like to apply the values of d2 to the groups of A and C of d1 so the resulting DF would look like this:
A C B
0 1 x a
1 1 x a
2 1 x a
3 2 y b
4 2 y b
5 2 y b
6 3 z c
7 3 z c
8 3 z c
How can I achieve this using Pandas?
If possible you can use Series.map with enumerate object converted to dictionary:
d1['b'] = d1['A'].map(dict(enumerate(d2['B'], 1)))
print (d1)
A b
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
General solutions with factorize for numeric values started by 0 and mapped to dictionary:
d = dict(zip(*pd.factorize(d2['B'])))
d1['B'] = pd.Series(pd.factorize(d1['A'])[0], index=d1.index).map(d)
#alternative
#d1['B'] = d1.groupby('A', sort=False).ngroup().map(d)
print (d1)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
To take duplicate categories in your d2 into account, we will use drop_duplicates with Series.map:
values = d2['B'].drop_duplicates()
values.index = values.index + 1
d1['B'] = d1['A'].map(values)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
You can use df.merge here.
d2.index+=1
d1.merge(d2,left_on='A',right_index=True)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c

Pandas: Restructure dataframe from column values

The pandas dataframe includes two columns 'A' and 'B'
A B
1 a b
2 a c d
3 x
Each value in column 'B' is a string containing a variable number of letters separated by spaces.
Is there a simple way to construct:
A B
1 a
1 b
2 a
2 c
2 d
3 x
You can use the following:
splitted = df.set_index("A")["B"].str.split(expand=True)
stacked = splitted.stack().reset_index(1, drop=True)
result = stacked.to_frame("B").reset_index()
print(result)
A B
0 1 a
1 1 b
2 2 a
3 2 c
4 2 d
5 3 x
For the sub steps, see below:
print(splitted)
0 1 2
A
1 a b None
2 a c d
3 x None None
print(stacked)
A
1 a
1 b
2 a
2 c
2 d
3 x
dtype: object
Or you may also use pd.melt:
splitted = df["B"].str.split(expand=True)
pd.melt(splitted.assign(A=df.A), id_vars="A", value_name="B")\
.dropna()\
.drop("variable", axis=1)\
.sort_values("A")
A B
0 1 a
3 1 b
1 2 a
4 2 c
7 2 d
2 3 x

top k columns with values in pandas dataframe for every row

I have a pandas dataframe like the following:
A B C D
0 7 2 5 2
1 3 3 1 1
2 0 2 6 1
3 3 6 2 9
There can be 100s of columns, in the above example I have only shown 4.
I would like to extract top-k columns for each row and their values.
I can get the top-k columns using:
pd.DataFrame({n: df.T[column].nlargest(k).index.tolist() for n, column in enumerate(df.T)}).T
which, for k=3 gives:
0 1 2
0 A C B
1 A B C
2 C B D
3 D B A
But what I would like to have is:
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Is there a pand(a)oic way to achieve this?
You can use numpy solution:
numpy.argsort for columns names
array already sort (thanks Jeff), need values by indices
interweave for new array
DataFrame constructor
k = 3
vals = df.values
arr1 = np.argsort(-vals, axis=1)
a = df.columns[arr1[:,:k]]
b = vals[np.arange(len(df.index))[:,None], arr1][:,:k]
c = np.empty((vals.shape[0], 2 * k), dtype=a.dtype)
c[:,0::2] = a
c[:,1::2] = b
print (c)
[['A' 7 'C' 5 'B' 2]
['A' 3 'B' 3 'C' 1]
['C' 6 'B' 2 'D' 1]
['D' 9 'B' 6 'A' 3]]
df = pd.DataFrame(c)
print (df)
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
>>> def foo(x):
... r = []
... for p in zip(list(x.index), list(x)):
... r.extend(p)
... return r
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
Or, using list comprehension:
>>> def foo(x):
... return [j for i in zip(list(x.index), list(x)) for j in i]
...
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
0 1 2 3 4 5
0 A 7 C 5 B 2
1 A 3 B 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3
This does the job efficiently : It uses argpartition that found the n biggest in O(n), then sort only them.
values=df.values
n,m=df.shape
k=4
I,J=mgrid[:n,:m]
I=I[:,:1]
if k<m: J=(-values).argpartition(k)[:,:k]
values=values[I,J]
names=np.take(df.columns,J)
J2=(-values).argsort()
names=names[I,J2]
values=values[I,J2]
names_and_values=np.empty((n,2*k),object)
names_and_values[:,0::2]=names
names_and_values[:,1::2]=values
result=pd.DataFrame(names_and_values)
For
0 1 2 3 4 5
0 A 7 C 5 B 2
1 B 3 A 3 C 1
2 C 6 B 2 D 1
3 D 9 B 6 A 3

Convert an nxn matrix to a pandas dataframe

I have an n by n data in csv in the following format
- A B C D
A 0 1 2 4
B 2 0 3 1
C 1 0 0 5
D 2 5 4 0
...
I would like to read it and convert to a 3D pandas dataframe in the following format:
Origin Dest Distance
A A 0
A B 1
A C 2
...
What is the best way to convert it? In the worst case, I'll write a for loop to read each line and append the transpose of it but there must be an easier way. Any help would be appreciated.
Use pd.melt()
Assuming, your dataframe looks like
In [479]: df
Out[479]:
- A B C D
0 A 0 1 2 4
1 B 2 0 3 1
2 C 1 0 0 5
3 D 2 5 4 0
In [480]: pd.melt(df, id_vars=['-'], value_vars=df.columns.values.tolist()[1:],
.....: var_name='Dest', value_name='Distance')
Out[480]:
- Dest Distance
0 A A 0
1 B A 2
2 C A 1
3 D A 2
4 A B 1
5 B B 0
6 C B 0
7 D B 5
8 A C 2
9 B C 3
10 C C 0
11 D C 4
12 A D 4
13 B D 1
14 C D 5
15 D D 0
Where df.columns.values.tolist()[1:] are remaining columns ['A', 'B', 'C', 'D']
To replace '-' with 'Origin', you could use dataframe.rename(columns={...})
pd.melt(df, id_vars=['-'], value_vars=df.columns.values.tolist()[1:],
var_name='Dest', value_name='Distance').rename(columns={'-': 'Origin'})

Categories