I have dflike following
A B C
a a d
a b d
a b e
b c e
b c f
When I try
df.groupby(A).size()
A
a 3
b 2
df.groupby(B).size()
B
a 1
b 2
c 2
my desired result is aggregated one
A B
a 3 1
b 2 2
c 0 2
Are there any way to achieve this result ?
If someone has opinion,please let me know.
Thanks
melt + crosstab
s = df[['A','B']].melt()
out = pd.crosstab(s['value'],s['variable'])
out
Out[18]:
variable A B
value
a 3 1
b 2 2
c 0 2
Or
df[['A','B']].apply(pd.Series.value_counts)
Out[19]:
A B
a 3.0 1
b 2.0 2
c NaN 2
Related
I am working on a data frame as below,
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','C','C','C','C'],
'B':['a','a','b','a','b','a','b','c','c'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B b
5 C a
6 C b
7 C c
8 C c
I want to create a new column with the sequence value for Column B subgroups based on Column A groups like below
A B C
0 A a 1
1 A a 1
2 A b 2
3 B a 1
4 B b 2
5 C a 3
6 C b 1
7 C c 2
8 C c 2
I tried this , but does not give me desired output
df['C'] = df.groupby(['A','B']).cumcount()+1
IIUC, I think you want something like this:
df['C'] = df.groupby('A')['B'].transform(lambda x: (x != x.shift()).cumsum())
Output:
A B C
0 A a 1
1 A a 1
2 A b 2
3 B a 1
4 B b 2
5 C c 1
6 C b 2
7 C c 3
8 C c 3
Having the following data frames:
d1 = pd.DataFrame({'A':[1,1,1,2,2,2,3,3,3]})
A C
0 1 'x'
1 1 'x'
2 1 'x'
3 2 'y'
4 2 'y'
5 2 'y'
6 3 'z'
7 3 'z'
8 3 'z'
d2 = pd.DataFrame({'B':['a','b','c']})
0 a
1 b
2 c
I would like to apply the values of d2 to the groups of A and C of d1 so the resulting DF would look like this:
A C B
0 1 x a
1 1 x a
2 1 x a
3 2 y b
4 2 y b
5 2 y b
6 3 z c
7 3 z c
8 3 z c
How can I achieve this using Pandas?
If possible you can use Series.map with enumerate object converted to dictionary:
d1['b'] = d1['A'].map(dict(enumerate(d2['B'], 1)))
print (d1)
A b
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
General solutions with factorize for numeric values started by 0 and mapped to dictionary:
d = dict(zip(*pd.factorize(d2['B'])))
d1['B'] = pd.Series(pd.factorize(d1['A'])[0], index=d1.index).map(d)
#alternative
#d1['B'] = d1.groupby('A', sort=False).ngroup().map(d)
print (d1)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
To take duplicate categories in your d2 into account, we will use drop_duplicates with Series.map:
values = d2['B'].drop_duplicates()
values.index = values.index + 1
d1['B'] = d1['A'].map(values)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
You can use df.merge here.
d2.index+=1
d1.merge(d2,left_on='A',right_index=True)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
The pandas dataframe includes two columns 'A' and 'B'
A B
1 a b
2 a c d
3 x
Each value in column 'B' is a string containing a variable number of letters separated by spaces.
Is there a simple way to construct:
A B
1 a
1 b
2 a
2 c
2 d
3 x
You can use the following:
splitted = df.set_index("A")["B"].str.split(expand=True)
stacked = splitted.stack().reset_index(1, drop=True)
result = stacked.to_frame("B").reset_index()
print(result)
A B
0 1 a
1 1 b
2 2 a
3 2 c
4 2 d
5 3 x
For the sub steps, see below:
print(splitted)
0 1 2
A
1 a b None
2 a c d
3 x None None
print(stacked)
A
1 a
1 b
2 a
2 c
2 d
3 x
dtype: object
Or you may also use pd.melt:
splitted = df["B"].str.split(expand=True)
pd.melt(splitted.assign(A=df.A), id_vars="A", value_name="B")\
.dropna()\
.drop("variable", axis=1)\
.sort_values("A")
A B
0 1 a
3 1 b
1 2 a
4 2 c
7 2 d
2 3 x
I have a dataframe below
A B
0 a 1
1 a 2
2 c 3
3 c 4
4 e 5
I would like to get summing result below.key = column A
df.B.groupby(df.A).agg(np.sum)
But I want to add specific row.
B
a 3
b 0
c 7
d 0
e 5
f 0
but I should add row "b" and "d"."f"
How can I get this result ?
Use reindex
df.groupby('A').B.sum().reindex(list('abcdef'), fill_value=0)
A
a 3
b 0
c 7
d 0
e 5
f 0
Name: B, dtype: int64
I have an n by n data in csv in the following format
- A B C D
A 0 1 2 4
B 2 0 3 1
C 1 0 0 5
D 2 5 4 0
...
I would like to read it and convert to a 3D pandas dataframe in the following format:
Origin Dest Distance
A A 0
A B 1
A C 2
...
What is the best way to convert it? In the worst case, I'll write a for loop to read each line and append the transpose of it but there must be an easier way. Any help would be appreciated.
Use pd.melt()
Assuming, your dataframe looks like
In [479]: df
Out[479]:
- A B C D
0 A 0 1 2 4
1 B 2 0 3 1
2 C 1 0 0 5
3 D 2 5 4 0
In [480]: pd.melt(df, id_vars=['-'], value_vars=df.columns.values.tolist()[1:],
.....: var_name='Dest', value_name='Distance')
Out[480]:
- Dest Distance
0 A A 0
1 B A 2
2 C A 1
3 D A 2
4 A B 1
5 B B 0
6 C B 0
7 D B 5
8 A C 2
9 B C 3
10 C C 0
11 D C 4
12 A D 4
13 B D 1
14 C D 5
15 D D 0
Where df.columns.values.tolist()[1:] are remaining columns ['A', 'B', 'C', 'D']
To replace '-' with 'Origin', you could use dataframe.rename(columns={...})
pd.melt(df, id_vars=['-'], value_vars=df.columns.values.tolist()[1:],
var_name='Dest', value_name='Distance').rename(columns={'-': 'Origin'})