I have a dataframe like this:
ID day purchase
ID1 1 10
ID1 2 15
ID1 4 13
ID2 2 11
ID2 4 11
ID2 5 24
ID2 6 10
Desired output:
ID day purchase Txn
ID1 1 10 1
ID1 2 15 2
ID1 4 13 3
ID2 2 11 1
ID2 4 11 2
ID2 5 24 3
ID2 6 10 4
So for each ID, i want to create a counter to keep a track of their transactions. In SAS, i would do something like First.ID then Txn=1 else Txn+1
How to do something like this in Python?
I got the idea of sorting by ID and day. But how to create customized counter?
Here is one solution. Like you suggest, it involves sorting by ID and day (in case your original dataframe isn't), and then grouping by ID, creating a counter for each ID:
# Make sure your dataframe is sorted properly (first by ID, then by day)
df = df.sort_values(['ID', 'day'])
# group by ID
by_id = df.groupby('ID')
# Make a custom counter using the default index of dataframes (adding 1)
df['txn'] = by_id.apply(lambda x: x.reset_index()).index.get_level_values(1)+1
>>> df
ID day purchase txn
0 ID1 1 10 1
1 ID1 2 15 2
2 ID1 4 13 3
3 ID2 2 11 1
4 ID2 4 11 2
5 ID2 5 24 3
6 ID2 6 10 4
If your dataframe started out as not properly sorted, you can get back to the original order like this:
df = df.sort_index()
The simplest method I could come up with, definitely not the most efficient though.
df['txn'] = [0]*len(df)
prev_ID = None
for index, row in df.iterrows():
if row['ID'] == prev_ID:
df['txn'][index] = counter
counter += 1
else:
prev_ID = row['ID']
df['txn'][index] = 1
counter = 2
outputs
ID day purchase txn
0 ID1 1 10 1
1 ID1 2 15 2
2 ID1 4 13 3
3 ID2 2 11 1
4 ID2 4 11 2
5 ID2 5 24 3
6 ID2 6 10 4
Related
I have 2 pandas dataframe
DF1
rowid
city
id2
id3
1
citya
10
8
2
cityb
20
9
DF2
city
id2
id3
cityc
10
8
cityd
10
4
citye
10
1
citye
20
9
cityf
20
4
citye
20
1
I want to concat 2 dataframe from id2 values.
But i need to add DF2 under to DF1 rows without duplicated values like this,
Note: on df1 i have too many id2 values with diffrent row number like (id2 : 10 , id3: 2) and i need to filter by row values before insert df2 values under to df1 rows
rowid
city
id2
id3
1
citya
10
8
cityd
10
4
citye
10
1
2
cityb
20
9
cityf
20
4
cityg
20
1
I dont have any idea about that
You can use concat and drop_duplicates:
>>> (pd.concat([df1, df2.assign(rowid='')])
.drop_duplicates(['id2', 'id3'])
.sort_values('id2', ignore_index=True))
rowid city id2 id3
0 1 citya 10 8
1 cityd 10 4
2 citye 10 1
3 2 cityb 20 9
Greeting everyone,
I have this table (Without the Res_Problem):
ID
Problem
X
Impact
Prob
Res_Problem
ID1
12
IDC1
1
2
(12-2)=10
ID1
12
IDC2
2
2
(10-4)=6 STOP
ID1
12
IDC3
1
0
NO LOOP
ID1
12
IDC4
1
0
NO LOOP
ID2
10
IDB1
1
2
New Loop (10-2)=8
ID2
10
IDB1
1
2
(8-2) = 6 STOP
I want to do a loop that multiplies the Impact and prob until get a desire value (6 for example),and stop the loop until it reach the 6. but start again the loop on the ID2... and so on, any suggestions?
I think it has to be something like this :
while (df['Problem'] - df['Impact']*df['Impact'] < 6):
df['loop'] = res
The loop should create the 'Res_Problem' column
Here is one option:
s = (df['Problem']
.sub(df['Impact'].mul(df['Prob'])
.groupby(df['ID']).cumsum()
)
)
m = s.le(6).groupby(df['ID']).shift(fill_value=False)
df['Res_Problem'] = s.mask(m)
output:
ID Problem X Impact Prob Res_Problem
0 ID1 12 IDC1 1 2 10.0
1 ID1 12 IDC2 2 2 6.0
2 ID1 12 IDC3 1 0 NaN
3 ID1 12 IDC4 1 0 NaN
4 ID2 10 IDB1 1 2 8.0
5 ID2 10 IDB1 1 2 6.0
This is a tricky one and I'm having a difficult time aggregating this data by week. So, starting on 5/26/20, for each week what is the total quantity? That is the desired dataframe. My data has 3 months worth of data points where some 'products' have 0 quantities and this needs to be reflected in the desired df.
Original DF:
Product Date Qty
A 5/26/20 4
A 5/28/20 2
A 5/31/20 2
A 6/02/20 1
A 6/03/20 5
A 6/05/20 2
B 5/26/20 1
B 5/27/20 8
B 6/02/20 2
B 6/06/20 10
B 6/14/20 7
Desired DF
Product Week Qty
A 1 9
A 2 7
A 3 0
B 1 11
B 2 10
B 3 7
We can do it with transform , then create the new week with subtract
s = (df.Date-df.groupby('Product').Date.transform('min')).dt.days//7 + 1
s = df.groupby([df.Product, s]).Qty.sum().unstack(fill_value=0).stack().reset_index()
s
Out[348]:
Product Date 0
0 A 1 8
1 A 2 8
2 A 3 0
3 B 1 9
4 B 2 12
5 B 3 7
This is the table:
order_id product_id reordered department_id
2 33120 1 16
2 28985 1 4
2 9327 0 13
2 45918 1 13
3 17668 1 16
3 46667 1 4
3 17461 1 12
3 32665 1 3
4 46842 0 3
I want to group by department_id, summing the number of orders that come from that department, as well as the number of orders from that department where reordered == 0. The resulting table would look like this:
department_id number_of_orders number_of_reordered_0
3 2 1
4 2 0
12 1 0
13 2 1
16 2 0
I know this can be done in SQL (I forget what the query for that would look like as well, if anyone can refresh my memory on that, that'd be great too). But what are the Pandas functions to make that work?
I know that it starts with df.groupby('department_id').sum(). Not sure how to flesh out the rest of the line.
Use GroupBy.agg with DataFrameGroupBy.size and lambda function for compare values by Series.eq and count by sum of True values (Trues are processes like 1):
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0',lambda x: x.eq(0).sum())])
.reset_index())
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
If values are only 1 and 0 is possible use sum and last subtract:
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0','sum')])
.reset_index())
df1['number_of_reordered_0'] = df1['number_of_orders'] - df1['number_of_reordered_0']
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
in sql it would be simple aggregation
select department_id,count(*) as number_of_orders,
sum(case when reordered=0 then 1 else 0 end) as number_of_reordered_0
from tabl_name
group by department_id
I have a dataframe that looks like the following:
ID1 ID2 Date
1 2 01/01/2018
1 2 03/01/2018
1 2 04/05/2018
2 1 06/06/2018
1 2 08/06/2018
3 4 09/07/2018
etc.
What I need to do is to flag the first time that an ID in ID1 appears in ID2. In the above example this would look like
ID1 ID2 Date Flag
1 2 01/01/2018
1 2 03/01/2018
1 2 04/05/2018
2 1 06/06/2018
1 2 08/06/2018 Y
3 4 09/07/2018
I've used the following code to tell me if ID1 ever occurs in ID2:
ID2List= df['ID2'].tolist()
ID2List= list(set(IDList)) # dedupe list
df['ID1 is in ID2List'] = np.where(df[ID1].isin(ID2List), 'Yes', 'No')
But this only tells me that there is an occasion where ID1 appears in ID2 at some point but not the event at which this first occurs.
Any help?
One idea is to use next with a generator expression to calculate the indices of matches in ID1. Then compare with index and use argmax to get the index of the first True value:
idx = df.apply(lambda row: next((idx for idx, val in enumerate(df['ID1']) \
if row['ID2'] == val), 0), axis=1)
df.loc[(df.index > idx).argmax(), 'Flag'] = 'Y'
print(df)
ID1 ID2 Date Flag
0 1 2 01/01/2018 NaN
1 1 2 03/01/2018 NaN
2 1 2 04/05/2018 NaN
3 2 1 06/06/2018 Y
4 1 2 08/06/2018 NaN
5 3 4 09/07/2018 NaN