I have the following example df:
col1 col2 col3 doc_no
0 a x f 0
1 a x f 1
2 b x g 2
3 b y g 3
4 c x t 3
5 c y t 4
6 a x f 5
7 d x t 5
8 d x t 6
I want to group by the first 3 columns (col1, col2, col3), concatenate the fourth column (doc_no) into a line of strings based on the groupings of the first 3 columns, as well as also generate a sorted count column of the 3 column grouping (count). Example desired output below (column order doesn't matter):
col1 col2 col3 count doc_no
0 a x f 3 0, 1, 5
1 d x t 2 5, 6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
How would I go about doing this? I used the below line to get just the grouping and the count:
grouped_df = df.groupby(['col1','col2','col3']).size().reset_index(name='count')\
.sort_values(['count'], ascending=False).reset_index()
But I'm not sure how to also get the concatenated doc_no column in the same code line.
Try groupby and agg like so:
(df.groupby(['col1', 'col2', 'col3'])['doc_no']
.agg(['count', ('doc_no', lambda x: ','.join(map(str, x)))])
.sort_values('count', ascending=False)
.reset_index())
col1 col2 col3 count doc_no
0 a x f 3 0,1,5
1 d x t 2 5,6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
agg is simple to use because you can specify a list of reducers to run on a single column.
Let us do
df.doc_no=df.doc_no.astype(str)
s=df.groupby(['col1','col2','col3']).doc_no.agg(['count',','.join]).reset_index()
s
col1 col2 col3 count join
0 a x f 3 0,1,5
1 b x g 1 2
2 b y g 1 3
3 c x t 1 3
4 c y t 1 4
5 d x t 2 5,6
Another way
df2=df.groupby(['col1','col2','col3']).doc_no.agg(doc_no=('doc_no',list)).reset_index()
df2['doc_no']=df2['doc_no'].astype(str).str[1:-1]
Related
The sample dataset looks like this
col1
col2
col3
A
1
as
A
2
sd
B
3
df
C
5
fg
D
6
gh
A
1
hj
B
3
jk
B
4
kt
A
1
re
C
5
we
D
6
qw
D
7
aa
I want to sort the column col1 based on the number of occurences each item has, e.g. A has 4 occurences, B and D have 3 and C has 2 occurences. The dataframe should be sorted like A,A,A,A,B,B,B,D,D,D,C,C so that
Is there a way to achieve the same? Can I use sort_values to get desired result?
Create helper column by Series.map with Series.value_counts and use it for sorting with col1 by DataFrame.sort_values:
df['new'] = df['col1'].map(df['col1'].value_counts())
#alternative
#df['new'] = df.groupby('col1')['col1'].transform('count')
df1 = df.sort_values(['new','col1'], ascending=[False, True]).drop('new', axis=1)
One line solution:
df1 = (df.assign(new =df['col1'].map(df['col1'].value_counts()))
.sort_values(['new','col1'], ascending=[False, True])
.drop('new', axis=1))
print (df1)
col1 col2 col3
0 A 1 as
1 A 2 sd
5 A 1 hj
8 A 1 re
2 B 3 df
6 B 3 jk
7 B 4 kt
4 D 6 gh
10 D 6 qw
11 D 7 aa
3 C 5 fg
9 C 5 we
You can use sort_values, but you have to provide a callable to the key. From documentation:
Apply the key function to the values before sorting. This is similar
to the key argument in the builtin sorted() function, with the notable
difference that this key function should be vectorized. It should
expect a Series and return a Series with the same shape as the input.
It will be applied to each column in by independently.
In your case, the key function has to count the number of times each value appears in col1.
df.sort_values(by='col1', key=lambda x: [((df.col1 == y).sum(), -ord(y)) for y in x], ascending=False)
The tuple ((df.col1 == y).sum(), -ord(y)) is used to sort letters that have the same number of occurrences, using the integer representing their Unicode character.
If your dataframe is large, you should precompute these values, using values_counts and map:
df.sort_values(by='col1', key=lambda x: df.col1.map({k: (v, -ord(k)) for k,v in df.col1.value_counts().to_dict().items()}), ascending=False)
Here the result:
col1 col2 col3
0 A 1 as
1 A 2 sd
5 A 1 hj
8 A 1 re
2 B 3 df
6 B 3 jk
7 B 4 kt
4 D 6 gh
10 D 6 qw
11 D 7 aa
3 C 5 fg
9 C 5 we
I have consecutive row duplicates in two column.
I want to delete the second row duplicate based on [col1,col2] and move the value of another column to a new one.
Example:
Input
col1 col2 col3
X A 1
X A 2
Y A 3
Y A 4
X B 5
X B 6
Z C 7
Z C 8
Output
col1 col2 col3 col4
X A 1 2
Y A 3 4
X B 5 6
Z C 7 8
I found out about pivoting but I am struggling to understand how to add another column and avoid indexing, I would to preserve everything as written in the example
This is similar to Question 10 here:
(df.assign(col=df.groupby(['col1','col2']).cumcount())
.pivot_table(index=['col1','col2'], columns='col', values='col3')
.reset_index()
)
Output:
col col1 col2 0 1
0 X A 1 2
1 X B 5 6
2 Y A 3 4
3 Z C 7 8
I have two dataframes:
Dataframe A:
Col1 Col2 Value
A X 1
A Y 2
B X 3
B Y 2
C X 5
C Y 4
Dataframe B:
Col1
A
B
C
What I need is to add to Dataframe B one column for each value in Col2 of Dataframe A (in this case, X and Y), and filling them with the values in column "Value" after having merged the two dataframes on Col1. Here is it:
Col1 X Y
A 1 2
B 3 2
C 5 4
Thank you very much for your help!
B['X'] = A.loc[A['Col2'] == 'X', 'Value'].reset_index(drop = True)
B['Y'] = A.loc[A['Col2'] == 'Y', 'Value'].reset_index(drop = True)
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4
If you are going to have 100s of distinct values in Col2 then you call the above two lines in a loop, like this:
for t in A['Col2'].unique():
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B
You get the same output:
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4
I have a df like below
a = pd.DataFrame([{'col1': ['a,b,c'], 'col2': 'x'},{'col1': ['d,b'], 'col2': 'y'}])
When I do an explode using df.explode(‘col1’), I get below results
col1 col2
a x
b x
c x
d y
b y
However, I wanted something like below,
col1 col2 col1_index
a x 1
b x 2
c x 3
d y 1
b y 2
Can someone help me?
You could do the following:
result = a.explode('col1').reset_index().rename(columns={'index' : 'col1_index'})
result['col1_index'] = result.groupby('col1_index').cumcount()
print(result)
Output
col1_index col1 col2
0 0 a x
1 1 b x
2 2 c x
3 0 d y
4 1 b y
After you explode you can simply do:
a['col1_index'] = a.groupby('col2').cumcount()+1
col1 col2 col1_index
0 a x 1
1 b x 2
2 c x 3
3 d y 1
4 b y 2
I have a data frame like this:
df
col1 col2 col3
1 A B
1 D R
2 R P
2 D F
3 T G
1 R S
3 R S
I want to get the data frame with first 3 unique value of col1. If some col1 value comes later in the df, it will ignore.
The final data frame should look like:
df
col1 col2 col3
1 A B
1 D R
2 R P
2 D F
3 T G
How to do it most efficient way in pandas ?
Create helper consecutive groups series with Series.ne, Series.shift and Series.cumsum and then filter by boolean indexing:
N = 3
df = df[df.col1.ne(df.col1.shift()).cumsum() <= N]
print (df)
col1 col2 col3
0 1 A B
1 1 D R
2 2 R P
3 2 D F
4 3 T G
Detail:
print (df.col1.ne(df.col1.shift()).cumsum())
0 1
1 1
2 2
3 2
4 3
5 4
6 5
Name: col1, dtype: int32
here is a solution which stops at once found the three first different values
import pandas as pd
data="""
col1 col2 col3
1 A B
1 D R
2 R P
2 D F
3 T G
1 R S
3 R S
"""
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
nbr = 3
dico={}
for index, row in df.iterrows():
dico[row.col1]=True
if len(dico.keys())==nbr:
df = df[0:index+1]
break
print(df)
col1 col2 col3
0 1 A B
1 1 D R
2 2 R P
3 2 D F
4 3 T G
You can use the duplicated method in pandas:
mask1 = df.duplicated(keep = "first") # this line is to get the first occ.
mask2 = df.duplicated(keep = False) # this line is to get the row that occ one single time.
mask = ~mask1 | ~mask2
df[mask]