How to group sequence based on group column assign a groupid - python

Below is the dataframe I have
ColA ColB Time ColC
A B 01-01-2022 ABC
A B 02-01-2022 ABC
A B 07-01-2022 XYZ
A B 11-01-2022 IJK
A B 14-01-2022 ABC
Desired result:
ColA ColB Time ColC groupID
A B 01-01-2022 ABC 1
A B 02-01-2022 ABC 1
A B 07-01-2022 XYZ 2
A B 11-01-2022 IJK 3
A B 14-01-2022 ABC 4
UPDATED:
Below is the code executed after cumsum
df['ColC'] = df['ColC'].ne(df['ColC'].shift(1)).groupby([df['ColA'],
df['ColB']]).cumsum()
ColA ColB Time ColC groupID
A B 01-01-2022 ABC 1
A B 02-01-2022 ABC 1
A B 07-01-2022 XYZ 2
A B 11-01-2022 XYZ 3
A B 14-01-2022 XYZ 4
A B 14-01-2022 XYZ 4
Thank you in advance

The logic is not fully clear, but it looks like you're trying to group by week number (and ColC):
df['groupID'] = (df
.groupby([pd.to_datetime(df['Time'], dayfirst=True).dt.isocalendar().week,
'ColC'], sort=False)
.ngroup().add(1)
)
output:
ColA ColB Time ColC groupID
0 A B 01-01-2022 ABC 1
1 A B 02-01-2022 ABC 1
2 A B 07-01-2022 XYZ 2
3 A B 11-01-2022 IJK 3
4 A B 14-01-2022 ABC 4

Related

Concatenate only new values of first column of one dataframe to an other one

I can't find a proper way to concat only new values of colA. This is quite simple, I need new elements of column A to be added from DF2 to DF1
DF1
colA colB colC
a 5 7
b 4 5
c 5 6
DF2
colA colE colF
a 7 e
b d 4
c f g
d h h
e 4 r
I have tried with a simple code as this, but the output dataframe is not correct :
DF3 = pd.concat([DF1, DF2['ColA']], keys=["ColA"])
DF3.drop_duplicates(subset=['ColA'], inplace=True, keep='last')
The result is that [a, 5, 7] is droped and replaced by [a, nan, nan]
What I need is this :
DF3 merged colA
colA colB colC
a 5 7
b 4 5
c 5 6
d
e
Then I fill DF3 missing values manually. I don't need colE neither colF in DF3.
You can use pandas.DataFrame.merge:
>>> DF1.merge(DF2, how='outer', on='colA').reindex(DF1.columns, axis=1)
colA colB colC
0 a 5.0 7.0
1 b 4.0 5.0
2 c 5.0 6.0
3 d NaN NaN
4 e NaN NaN
Edit
To remove NaN and convert other vals back to int, you can try:
>>> df.merge(df2['colA'], how='outer').fillna(-1, downcast='infer').replace({-1:''})
colA colB colC
0 a 5 7
1 b 4 5
2 c 5 6
3 d
4 e
# if -1 part is a concern, then, convert to "Int64"
>>> df.astype({'colB': 'Int64', 'colC': 'Int64'}).merge(df2['colA'], how='outer')
colA colB colC
0 a 5 7
1 b 4 5
2 c 5 6
3 d <NA> <NA>
4 e <NA> <NA>
# You can replace the NaN's with string as well:
>>> df.astype({
'colB': 'Int64',
'colC': 'Int64'
}).merge(df2['colA'], how='outer').replace({np.nan: ''})
colA colB colC
0 a 5 7
1 b 4 5
2 c 5 6
3 d
4 e
Remove keep='last' for default value keep='first':
DF3.drop_duplicates(subset=['ColA'], inplace=True, keep='last')
to:
DF3.drop_duplicates(subset=['ColA'], inplace=True)
Or just outer merge DF2[['colA']]
DF1.merge(DF2[['colA']], how='outer')

Condense three dataframes to one, remove duplicates and nans [duplicate]

My question is closely related to Pandas Merge - How to avoid duplicating columns but not identical.
I want to concatenate the columns that are different in three dataframes. The dataframes have a column id, and some columns that are identical: Ex.
df1
id place name qty unit A
1 NY Tom 2 10 a
2 TK Ron 3 15 a
3 Lon Don 5 90 a
4 Hk Sam 4 49 a
df2
id place name qty unit B
1 NY Tom 2 10 b
2 TK Ron 3 15 b
3 Lon Don 5 90 b
4 Hk Sam 4 49 b
df3
id place name qty unit C D
1 NY Tom 2 10 c d
2 TK Ron 3 15 c d
3 Lon Don 5 90 c d
4 Hk Sam 4 49 c d
Result:
id place name qty unit A B C D
1 NY Tom 2 10 a b c d
2 TK Ron 3 15 a b c d
3 Lon Don 5 90 a b c d
4 Hk Sam 4 49 a b c d
The columns place, name, qty, and unit will always be part of the three dataframes, the names of columns that are different could vary (A,B,C,D in my example). The three dataframes have the same number of rows.
I have tried:
cols_to_use = df1.columns - df2.columns
dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')
The problem is that I get more rows than expected and columns renamed in the resulting dataframe (when using concat).
Using reduce from functools
from functools import reduce
reduce(lambda left,right: pd.merge(left,right), [df1,df2,df3])
Out[725]:
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
You can use nested merge
merge_on = ['id','place','name','qty','unit']
df1.merge(df2, on = merge_on).merge(df3, on = merge_on)
id place name qty unit A B C D
0 1 NY Tom 2 10 a b c d
1 2 TK Ron 3 15 a b c d
2 3 Lon Don 5 90 a b c d
3 4 Hk Sam 4 49 a b c d
Using concat with groupby and first:
pd.concat([df1, df2, df3], 1).groupby(level=0, axis=1).first()
A B C D id name place qty unit
0 a b c d 1 Tom NY 2 10
1 a b c d 2 Ron TK 3 15
2 a b c d 3 Don Lon 5 90
3 a b c d 4 Sam Hk 4 49
You can extract only those columns from df2 (and df3 similarly) which are not already present in df1. Then just use pd.concat to concatenate the data frames:
cols = [c for c in df2.columns if c not in df1.columns]
df = pd.concat([df1, df2[cols]], axis=1)

Pandas deleting rows in order

Given a particular df:
ID Text
1 abc
1 xyz
2 xyz
2 abc
3 xyz
3 abc
3 ijk
4 xyz
I want to apply condition where: Grouping by ID, if abc exists then delete row with xyz. The outcome would be:
ID Text
1 abc
2 abc
3 abc
3 ijk
4 xyz
Usually I would group them by Id and apply np.where(...). However, I don't think this approach would work for this case since it's based on rows.
Many thanks!
To the best of my knowledge, you can vectorize this with a groupby + transform:
df[~(df.Text.eq('abc').groupby(df.ID).transform('any') & df.Text.eq('xyz'))]
ID Text
0 1 abc
3 2 abc
5 3 abc
6 3 ijk
7 4 xyz
I am using crosstab
s=pd.crosstab(df.ID,df.Text)
s.xyz=s.xyz.mask(s.abc.eq(1)&s.xyz.eq(1))
s
Out[162]:
Text abc ijk xyz
ID
1 1 0 NaN
2 1 0 NaN
3 1 1 NaN
4 0 0 1.0
s.replace(0,np.nan).stack().reset_index().drop(0,1)
Out[167]:
ID Text
0 1 abc
1 2 abc
2 3 abc
3 3 ijk
4 4 xyz

Add values from columns into a new column using pandas

I have a dataframe:
id category value
1 1 abc
2 2 abc
3 1 abc
4 4 abc
5 4 abc
6 3 abc
Category 1 = best, 2 = good, 3 = bad, 4 =ugly
I want to create a new column such that, for category 1 the value in the column should be cat_1, for category 2, the value should be cat2.
in new_col2 for category 1 should be cat_best, for category 2, the value should be cat_good.
df['new_col'] = ''
my final df
id category value new_col new_col2
1 1 abc cat_1 cat_best
2 2 abc cat_2 cat_good
3 1 abc cat_1 cat_best
4 4 abc cat_4 cat_ugly
5 4 abc cat_4 cat_ugly
6 3 abc cat_3 cat_bad
I can iterate it in for loop:
for index,row in df.iterrows():
df.loc[df.id == row.id,'new_col'] = 'cat_'+str(row['category'])
Is there a better way of doing it (least time consuming)
I think you need join string with column converted to string and map with join for second column:
d = {1:'best', 2: 'good', 3 : 'bad', 4 :'ugly'}
df['new_col'] = 'cat_'+ df['category'].astype(str)
df['new_col2'] = 'cat_'+ df['category'].map(d)
Or:
df = df.assign(new_col= 'cat_'+ df['category'].astype(str),
new_col2='cat_'+ df['category'].map(d))
print (df)
id category value new_col new_col2
0 1 1 abc cat_1 cat_best
1 2 2 abc cat_2 cat_good
2 3 1 abc cat_1 cat_best
3 4 4 abc cat_4 cat_ugly
4 5 4 abc cat_4 cat_ugly
5 6 3 abc cat_3 cat_bad
You can do it by using apply also:
df['new_col']=df['category'].apply(lambda x: "cat_"+str(x))

How do I exclude NaN/NaT/None from a python groupby count, but include the row?

>> df
Foo Bar Number Date
0 abc None NaN NaT
1 abcdefg None NaN NaT
2 abcd this 1111222 3/8/2017
3 abcd that 1233336 3/3/2017
4 abcd what 1346554 3/3/2017
5 abcde that 8889995 3/9/2017
6 abcde this 1849552 3/8/2017
7 abcd that 7418652 3/3/2017
8 abcdef this 4865154 3/7/2017
>> df.groupby(['Foo']).size().reset_index(name='Total')
If I do it this way, the row is counted as having one value, which it does and I understand that. I'm not sure how to include the row in the Total, but not actually count None/NaN/NaT value?
Returns:
Foo Total
0 abc 1
1 abcd 4
2 abcde 2
3 abcdef 1
4 abcdefg 1
Expected result:
Foo Total
0 abc 0
1 abcd 4
2 abcde 2
3 abcdef 1
4 abcdefg 0
You could drop nulls first, then reindex with the unique values of the Foo column at the end with a fill value.
(df.dropna().groupby('Foo')
.size()
.reindex(df.Foo.unique(), fill_value=0)
.reset_index(name='total'))
Or alternatively, you could make your Foo column Categorical.
df.Foo = pd.Categorical(df.Foo)
df.dropna().groupby('Foo').size().reset_index(name='total')
Demo
>>> (df.dropna().groupby('Foo')
.size()
.reindex(df.Foo.unique(), fill_value=0)
.reset_index(name='total'))
Foo total
0 abc 0
1 abcdefg 0
2 abcd 4
3 abcde 2
4 abcdef 1
############################################################################
>>> df.Foo = pd.Categorical(df.Foo)
>>> df.dropna().groupby('Foo').size().reset_index(name='total')
Foo total
0 abc 0
1 abcd 4
2 abcde 2
3 abcdef 1
4 abcdefg 0

Categories