aggregate under certain condition - python

I have this data frame.
df = pd.DataFrame({'day':[1,2,1,4,2,3], 'user':['A','B','B','B','A','A'],
'num_posts':[1,2,3,4,5,6]})
I want a new column containing the total number of posts for that user to date of that post excluding that day. What I want looks like this:
user day num_post total_todate
A 1 1 0
B 2 2 3
B 1 3 0
B 4 4 5
A 2 5 1
A 3 6 6
Any ideas?

You can sort data frame by day, group by user, calculate the cumulative sum of num_posts column and then shift it down by 1:
df['total_todate'] = (df.sort_values('day').groupby('user').num_posts
.transform(
lambda p: p.cumsum().shift()
).fillna(0))
df
# day num_posts user total_todate
#0 1 1 A 0.0
#1 2 2 B 3.0
#2 1 3 B 0.0
#3 4 4 B 5.0
#4 2 5 A 1.0
#5 3 6 A 6.0

Related

Transforming Dataframe from days to weeks and aggregating quantity column

This is a tricky one and I'm having a difficult time aggregating this data by week. So, starting on 5/26/20, for each week what is the total quantity? That is the desired dataframe. My data has 3 months worth of data points where some 'products' have 0 quantities and this needs to be reflected in the desired df.
Original DF:
Product Date Qty
A 5/26/20 4
A 5/28/20 2
A 5/31/20 2
A 6/02/20 1
A 6/03/20 5
A 6/05/20 2
B 5/26/20 1
B 5/27/20 8
B 6/02/20 2
B 6/06/20 10
B 6/14/20 7
Desired DF
Product Week Qty
A 1 9
A 2 7
A 3 0
B 1 11
B 2 10
B 3 7
We can do it with transform , then create the new week with subtract
s = (df.Date-df.groupby('Product').Date.transform('min')).dt.days//7 + 1
s = df.groupby([df.Product, s]).Qty.sum().unstack(fill_value=0).stack().reset_index()
s
Out[348]:
Product Date 0
0 A 1 8
1 A 2 8
2 A 3 0
3 B 1 9
4 B 2 12
5 B 3 7

Pandas column difference over time

**Edit at bottom **
I have a data frame with inventory data that looks like the following:
d = {'product': [a, b, a, b, c], 'amount': [1, 2, 3, 5, 2], 'date': [2020-6-6, 2020-6-6, 2020-6-7,
2020-6-7, 2020-6-7]}
df = pd.DataFrame(data=d)
df
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 a 3 2020-6-7
3 b 5 2020-6-7
4 c 2 2020-6-7
I would like to know what the inventory difference is month to month. The output would look like this:
df
product diff isnew date
0 a nan nan 2020-6-6
1 b nan nan 2020-6-6
2 a 2 False 2020-6-7
3 b 3 False 2020-6-7
4 c 2 True 2020-6-7
Sorry if I was not clear in the first example, In reality I have many months of data, so I am not just looking at doing the difference of one period vs the other. It would need to be a general case where it looks at the difference of month n vs n-1 and then n-1 and n-2 and so on.
What's the best way to do this in Pandas?
I guess the key here is to find the isnew:
# new products by `product`
new_prods = df['date'] != df.date.min()
duplicated = df.duplicated('product')
# first appearance of new products
# or duplicated *old* products
valids = new_prods ^ duplicated
df.loc[valids,'is_new'] = ~ duplicated
# then the difference:
df['diff'] = (df.groupby('product')['amount'].diff() # normal differences
.fillna(df['amount']) # fill the first value for all product
.where(df['is_new'].notna()) # remove the first month
)
Output:
product amount date is_new diff
0 a 1 2020-6-6 NaN NaN
1 b 2 2020-6-6 NaN NaN
2 a 3 2020-6-7 False 2.0
3 b 5 2020-6-7 False 3.0
4 c 2 2020-6-7 True 2.0
you can try groupby on the column product and diff the column amount for the column 'diff'. Then use duplicated for the column 'isnew'.
df['diff'] = df.groupby('product')['amount'].diff()
df['isnew'] = ~df['product'].duplicated()
print (df)
product amount date diff isnew
0 a 1 2020-6-6 NaN True
1 b 2 2020-6-6 NaN True
2 a 3 2020-6-7 2.0 False
3 b 5 2020-6-7 3.0 False
4 c 2 2020-6-7 NaN True

pandas number of items in one column per value in another column

I have two dataframes. say for example, frame 1 is the student info:
student_id course
1 a
2 b
3 c
4 a
5 f
6 f
frame 2 is each interaction the student has with a program
student_id day number_of_clicks
1 4 60
1 5 34
1 7 87
2 3 33
2 4 29
2 8 213
2 9 46
3 2 103
I am trying to add the information from frame 2 to frame 1, ie. for each student I would like to know the number of different days they accessed the database on, and the sum of all the clicks on those days. eg:
student_id course no_days total_clicks
1 a 3 181
2 b 4 321
3 c 1 103
4 a 0 0
5 f 0 0
6 f 0 0
I've tried to do this with groupby, but I couldn't add the information back into frame 1, or figure out how to sum the number of clicks. any ideas?
First we aggregate your df2 to the desired information using GroupBy.agg. Then we merge that information into df1:
agg = df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
)
df1 = df1.merge(agg, on='student_id', how='left').fillna(0)
student_id course no_days total_clicks
0 1 a 3.0 181.0
1 2 b 4.0 321.0
2 3 c 1.0 103.0
3 4 a 0.0 0.0
4 5 f 0.0 0.0
5 6 f 0.0 0.0
Or if you like one-liners, here's the same method as above, but in one line of code and more in SQL kind of style:
df1.merge(
df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
),
on='student_id',
how='left'
).fillna(0)
Use merge and fillna the null values then aggregate using groupby.agg as:
df = df1.merge(df2, how='left').fillna(0, downcast='infer')\
.groupby(['student_id', 'course'], as_index=False)\
.agg({'day':np.count_nonzero, 'number_of_clicks':np.sum}).reset_index()
print(df)
student_id course day number_of_clicks
0 1 a 3 181
1 2 b 4 321
2 3 c 1 103
3 4 a 0 0
4 5 f 0 0
5 6 f 0 0
​

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

Pandas - create total column based on other column

I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15

Categories