How to sum dataframe in pandas for more than 5 - python

in my days column if my number is more than 5 I want to sum that column
Input:
Days files
1 10
3 20
4 30
5 40
6 50
Required output:
Days files
1 10
3 20
4 30
5+ 90

You can try series.clip to clip the upperbound in the series , then do a groupby and sum:
out = df['files'].groupby(df['Days'].clip(upper=5)).sum().reset_index()
print(out)
Days files
0 1 10
1 3 20
2 4 30
3 5 90
If you really want to change anything above 5 into a str type , you can jst replace the 5 using the above logic:
out = df['files'].groupby(df['Days'].clip(upper=5).replace(5,'5+')).sum().reset_index()
print(out)
Days files
0 1 10
1 3 20
2 4 30
3 5+ 90

I'd do it like this:
above = df.files[df.Days >= 5].sum()
df[df.Days < 5].append({'Days': '5+', 'files': above}, ignore_index=True)
It gives:
Days files
0 1 10
1 3 20
2 4 30
3 5+ 90

Related

Check if value in a dataframe is between two values in another dataframe

I have a pretty similiar question to another question on here.
Let's assume I have two dataframes:
df
volumne
11
24
30
df2
range_low range_high price
10 20 1
21 30 2
How can I filter the second dataframe, based for one row of the first dataframe, if the value range is true?
So for example (value 11 from df) leads to:
df3
range_low range_high price
10 20 1
wherelese (value 30 from df) leads to:
df3
I am looking for a way to check, if a specific value is in a range of another dataframe, and filter the dataframe based on this condition. In none python code:
Find 11 in
(10, 20), if True: df3 = filter on this row
(21, 30), if True: df3= filter on this row
if not
return empty frame
For loop solution use:
for v in df['volumne']:
df3 = df2[(df2['range_low'] < v) & (df2['range_high'] > v)]
print (df3)
For non loop solution is possible use cross join, but if large DataFrames there should be memory problem:
df = df.assign(a=1).merge(df2.assign(a=1), on='a', how='outer')
print (df)
volumne a range_low range_high price
0 11 1 10 20 1
1 11 1 21 30 2
2 24 1 10 20 1
3 24 1 21 30 2
4 30 1 10 20 1
5 30 1 21 30 2
df3 = df[(df['range_low'] < df['volumne']) & (df['range_high'] > df['volumne'])]
print (df3)
volumne a range_low range_high price
0 11 1 10 20 1
3 24 1 21 30 2
I have a similar problem (but with date ranges), and if df2 is too large, it will take for ever.
If the volumes are always integers, a faster solution is to create an intermediate dataframe where you associate each possible volume to a price (in one iteration) and then merge.
price_list=[]
for index, row in df2.iterrows():
x=pd.DataFrame(range(row['range_low'],row['range_high']+1),columns=['volume'])
x['price']=row['price']
price_list.append(x)
df_prices=pd.concat(price_list)
you will get something like this
volume price
0 10 1
1 11 1
2 12 1
3 13 1
4 14 1
5 15 1
6 16 1
7 17 1
8 18 1
9 19 1
10 20 1
0 21 2
1 22 2
2 23 2
3 24 2
4 25 2
5 26 2
6 27 2
7 28 2
8 29 2
9 30 2
then you can quickly associate associate a price to each volume in df
df.merge(df_prices,on='volume')
volume price
0 11 1
1 24 2
2 30 2

Calculate difference sequentially by groups in pandas

I'm trying to calculate the difference between two columns sequentially as efficiently as possible. My DataFrame looks like this:
category sales initial_stock
1 2 20
1 6 20
1 1 20
2 4 30
2 6 30
2 5 30
2 7 30
And I want to calculate a variable final_stock, like this:
category sales initial_stock final_stock
1 2 20 18
1 6 20 12
1 1 20 11
2 4 30 26
2 6 30 20
2 5 30 15
2 7 30 8
Thus, final_stock first equals initial_stock - sales and the it equals final_stock.shift() - sales, for each category. I managed to do this with for loops, but it is quite slow and my feeling says there's probably a one or two liner solution to this problem. Do you have any ideas?
Thanks
Use groupby and cumsum on "sales" to get the cumulative stock sold per category, then subtract from "initial_stock":
df['final_stock'] = df['initial_stock'] - df.groupby('category')['sales'].cumsum()
df
category sales initial_stock final_stock
0 1 2 20 18
1 1 6 20 12
2 1 1 20 11
3 2 4 30 26
4 2 6 30 20
5 2 5 30 15
6 2 7 30 8

Get longest streak of consecutive weeks by group in pandas

Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

How to sort Pandas DataFrame both by MultiIndex and by value?

Sample data:
mdf = pd.DataFrame([[1,2,50],[1,2,20],
[1,5,10],[2,8,80],
[2,5,65],[2,8,10]
], columns=['src','dst','n']); mdf
src dst n
0 1 2 50
1 1 2 20
2 1 5 10
3 2 8 80
4 2 5 65
5 2 8 10
groupby() gives a two-level multi-index:
test = mdf.groupby(['src','dst'])['n'].agg(['sum','count']); test
sum count
src dst
1 2 70 2
5 10 1
2 5 65 1
8 90 2
Question: how to sort this DataFrame by src ascending and then by sum descending?
I'm a beginner with pandas, learned about sort_index() and sort_values(), but in this task it seems that I need both simultaneously.
Expected result, under each "src" sorting is determined by the "sum":
sum count
src dst
1 2 70 2
5 10 1
2 8 90 2
5 65 1
In case anyone else comes across this using google as well. Since pandas version 0.23, you can pass the name of the level as an argument to sort_values:
test.sort_values(['src','sum'], ascending=[1,0])
Result:
sum count
src dst
1 2 70 2
5 10 1
2 8 90 2
5 65 1
IIUC:
In [29]: test.sort_values('sum', ascending=False).sort_index(level=0)
Out[29]:
sum count
src dst
1 2 80 2
5 10 1
2 8 80 1
UPDATE: very similar to #anonyXmous's solution:
In [47]: (test.reset_index()
.sort_values(['src','sum'], ascending=[1,0])
.set_index(['src','dst']))
Out[47]:
sum count
src dst
1 2 70 2
5 10 1
2 8 90 2
5 65 1
You can reset the index then sort them by chosen columns. Hope this helps.
import pandas as pd
mdf = pd.DataFrame([[1,2,50],[1,2,20],
[1,5,10],[2,8,80],
[2,5,65],[2,8,10]
], columns=['src','dst','n']);
mdf = mdf.groupby(['src','dst'])['n'].agg(['sum','count']);
mdf.reset_index(inplace=True)
mdf.sort_values(['src', 'sum'], ascending=[True, False], inplace=True)
print(mdf)
Result:
src dst sum count
0 1 2 70 2
1 1 5 10 1
3 2 8 90 2
2 2 5 65 1

pandas get average of a groupby

I am trying to find the average monthly cost per user_id but i am only able to get average cost per user or monthly cost per user.
Because i group by user and month, there is no way to get the average of the second groupby (month) unless i transform the groupby output to something else.
This is my df:
df = { 'id' : pd.Series([1,1,1,1,2,2,2,2]),
'cost' : pd.Series([10,20,30,40,50,60,70,80]),
'mth': pd.Series([3,3,4,5,3,4,4,5])}
cost id mth
0 10 1 3
1 20 1 3
2 30 1 4
3 40 1 5
4 50 2 3
5 60 2 4
6 70 2 4
7 80 2 5
I can get monthly sum but i want the average of the months for each user_id.
df.groupby(['id','mth'])['cost'].sum()
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
i want something like this:
id average_monthly
1 (30+30+40)/3
2 (50+130+80)/3
Resetting the index should work. Try this:
In [19]: df.groupby(['id', 'mth']).sum().reset_index().groupby('id').mean()
Out[19]:
mth cost
id
1 4.0 33.333333
2 4.0 86.666667
You can just drop mth if you want. The logic is that after the sum part, you have this:
In [20]: df.groupby(['id', 'mth']).sum()
Out[20]:
cost
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
Resetting the index at this point will give you unique months.
In [21]: df.groupby(['id', 'mth']).sum().reset_index()
Out[21]:
id mth cost
0 1 3 30
1 1 4 30
2 1 5 40
3 2 3 50
4 2 4 130
5 2 5 80
It's just a matter of grouping it again, this time using mean instead of sum. This should give you the averages.
Let us know if this helps.
df_monthly_average = (
df.groupby(["InvoiceMonth", "InvoiceYear"])["Revenue"]
.sum()
.reset_index()
.groupby("Revenue")
.mean()
.reset_index()
)

Categories