pandas: group the continuous rows with same values into one group - python

Assuming that I have a pandas dataframe of purchase, with no invoice ID like that
item_id customer_id
1 A
2 A
1 B
3 C
4 C
1 A
5 A
So, my assumption is, if multiple items are bought by a customer in continuous orders, they belong to one group. So I would like to create an order_id column as:
item_id customer_id order_id
1 A 1
2 A 1
1 B 2
3 C 3
4 C 3
1 A 4
5 A 4
The order_id shall be created automatically and incremental. How should I do that with pandas?
Many thanks

IIUC, here's one way:
df['order_id'] = df.customer_id.ne(df.customer_id.shift()).cumsum()
OUTPUT:
item_id customer_id order_id
0 1 A 1
1 2 A 1
2 1 B 2
3 3 C 3
4 4 C 3
5 1 A 4
6 5 A 4

Related

Search and update values in other dataframe for specific columns

I have two different dataframe in pandas.
First
A
B
C
D
VALUE
1
2
3
5
0
1
5
3
2
0
2
5
3
2
0
Second
A
B
C
D
Value
5
3
3
2
1
1
5
4
3
1
I want column values A and B in the first dataframe to be searched in the second dataframe. If A and B values match then update the Value column.Search only 2 columns in other dataframe and update only 1 column. Actually the process we know in sql.
Result
A
B
C
D
VALUE
1
2
3
5
0
1
5
3
2
1
2
5
3
2
0
If you focus on the bold text, you can understand it more easily.Despite my attempts, I could not succeed. I only want 1 column to change but it also changes A and B. I only want the Value column of matches to change.
You can use a merge:
cols = ['A', 'B']
df1['VALUE'] = (df2.merge(df1[cols], on=cols, how='right')
['Value'].fillna(df1['VALUE'], downcast='infer')
)
output:
A B C D VALUE
0 1 2 3 5 0
1 1 5 3 2 1
2 2 5 3 2 0

Create subject-wise timepoints for date column in dataframe

I have a data frame that contains a column for the subject id and a column containing information about the date. I want to create a third variable that indicates the time order of the dates for each subject. An example:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,2,2,3,3,3],
'Date':[20191219,
20191220,
20191220,
20191219,
20191219,
20191220,
20191221]})
which gives you:
ID Date
0 1 20191219
1 1 20191220
2 2 20191220
3 2 20191219
4 3 20191219
5 3 20191220
6 3 20191221
Add a third variable t so that you get:
ID Date t
0 1 20191219 0
1 1 20191220 1
2 2 20191220 1
3 2 20191219 0
4 3 20191219 0
5 3 20191220 1
6 3 20191221 2
After clarification, I think you need groupby.cumcount() method, but before that you need to sort values by ID, and Date and drop any duplicates:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,2,2,3,3,3,4],
'Foo':['a','b','c','d','e','f','g','h'],
'Date':[20191219,
20191219,
20191220,
20191219,
20191219,
20191220,
20191221,
20191222]})
df['t'] = df.sort_values(['ID', 'Date']).drop_duplicates(['ID', 'Date']).groupby('ID').cumcount()
df['t'] = df['t'].fillna(method='ffill').astype(int)
print(df)
Prints:
ID Foo Date t
0 1 a 20191219 0
1 1 b 20191219 0
2 2 c 20191220 1
3 2 d 20191219 0
4 3 e 20191219 0
5 3 f 20191220 1
6 3 g 20191221 2
7 4 h 20191222 0

Replace string values in pandas to their count

I`m trying to calculate count of some values in data frame like
user_id event_type
1 a
1 a
1 b
2 a
2 b
2 c
and I want to get table like
user_id event_type event_type_count
1 a 2
1 a 2
1 b 1
2 a 1
2 b 1
2 c 2
2 c 2
In other words, I want to insert count of value instead value in data frame.
I've tried use df.join(pd.crosstab)..., but I get a large data frame with many columns.
Which way is better to solve this problem ?
Use GroupBy.transform by both columns with GroupBy.size:
df['event_type_count'] = df.groupby(['user_id','event_type'])['event_type'].transform('size')
print (df)
user_id event_type event_type_count
0 1 a 2
1 1 a 2
2 1 b 1
3 2 a 1
4 2 b 1
5 2 c 2
6 2 c 2

Count the amount of times value A occurs with value B

I'm trying to count the amount of times a value in a Pandas dataframe occurs along with another value and count the amount of times for each row.
This is what I mean:
a t
0 a 2
1 b 4
2 c 2
3 g 2
4 b 3
5 a 2
6 b 3
Say I want to count the amount of times a occurs along with the number 2, I'd like the result to be:
a t freq
0 a 2 2
1 b 4 1
2 c 2 1
3 g 2 1
4 b 3 2
5 a 2 2
6 b 3 2
The freq (frequency) column here indicates the amount of times a value in column a appears along with a value in column t.
Please note that a solution that e.g. only counts the amount of times a occurs will result in a wrong frequency considering the size of my dataframe.
Is there a way to achieve this in Python?
Use transform with size or count:
df['freq'] = df.groupby(['a', 't'])['a'].transform('size')
#alternative solution
#df['freq'] = df.groupby(['a', 't'])['a'].transform('count')
print (df)
a t freq
0 a 2 2
1 b 4 1
2 c 2 1
3 g 2 1
4 b 3 2
5 a 2 2
6 b 3 2

Converting from group by output to separate columns in pandas

I have a grouped by pandas dataframe that looks like so:
id type count
1 A 4
1 B 5
2 A 3
3 C 0
3 B 6
and I am hoping to get an output:
id A B C
1 4 5 0
2 3 0 0
3 0 0 6
I feel like there is a straightforward solution to this that I am not seeing.
use pivot
df.pivot('id', 'type', 'count').fillna(0)

Categories