Calculate average revenue per user per month - python

I have the following dataframe:
Timestamp userid Prices_USD
0 2016-12-01 6.270941895 1.08
1 2016-12-01 6.609813209 1.12
2 2016-12-01 6.632094115 9.70
3 2016-12-01 6.655789772 1.08
4 2016-12-01 6.764640751 9.33
... ... ... ...
1183 2017-03-27 6.529604089 1.08
1184 2017-03-27 6.682639674 6.72
1185 2017-03-27 6.773815105 10.0
I want to calculate, for each unique userid, their monthly spending.
I've tried the following:
sales_per_user.set_index('Timestamp',inplace=True)
sales_per_user.index = pd.to_datetime(sales_per_user.index)
m = sales_per_user.index.month
monthly_avg = sales_per_user.groupby(['userid', m]).Prices_USD.mean().to_frame()
But the resulting dataframe is this:
userid Timestamp Prices_USD
3.43964843 12 10.91
3.885813375 1 10.91
2 10.91
12 21.82
However, the timestamp column doesn't have the desired outcome. Ideally I would like
userid Timestamp Prices_USD
3.43964843 2016-12 10.91
3.885813375 2017-01 10.91
2017-02 10.91
2017-12 21.82
How do I fix that?

Try:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
res = df.groupby([df['userid'], df['Timestamp'].dt.to_period('M')])['Prices_USD'].sum()
print(res)
Output
userid Timestamp
6.270942 2016-12 1.08
6.529604 2017-03 1.08
6.609813 2016-12 1.12
6.632094 2016-12 9.70
6.655790 2016-12 1.08
6.682640 2017-03 6.72
6.764641 2016-12 9.33
6.773815 2017-03 10.00
Name: Prices_USD, dtype: float64

Related

Getting NaN when using pandas groupby

I have a dataframe like so:
index date symbol stock_id open high low close volume vwap
0 0 2021-10-11 BVN 13 7.69 7.98 7.5600 7.61 879710 7.782174
1 1 2021-10-12 BVN 13 7.67 8.08 7.5803 8.02 794436 7.967061
2 2 2021-10-13 BVN 13 8.12 8.36 8.0900 8.16 716012 8.231286
3 3 2021-10-14 BVN 13 8.26 8.29 8.0500 8.28 586091 8.185899
4 4 2021-10-15 BVN 13 8.18 8.44 8.0600 8.44 1278409 8.284539
... ... ... ... ... ... ... ... ... ... ...
227774 227774 2022-10-04 ERIC 11000 6.27 6.32 6.2400 6.29 14655189 6.280157
227775 227775 2022-10-05 ERIC 11000 6.17 6.31 6.1500 6.29 10569193 6.219965
227776 227776 2022-10-06 ERIC 11000 6.20 6.25 6.1800 6.22 7918812 6.217198
227777 227777 2022-10-07 ERIC 11000 6.17 6.19 6.0800 6.10 9671252 6.135976
227778 227778 2022-10-10 ERIC 11000 6.13 6.15 6.0200 6.04 6310661 6.066256
[227779 rows x 10 columns]
And then a function to return a boolean mask on whether or not the df is consolidating inside of a range:
def is_consolidating(df, window=2, minp=2, percentage=0.95):
rolling_min = pd.Series(df['close']).rolling(window=window, min_periods=minp).min()
rolling_max = pd.Series(df['close']).rolling(window=window, min_periods=minp).max()
consolidation = np.where( (rolling_min / rolling_max) >= percentage, True, False)
return consolidation
Which I then call like:
df['t'] = df.groupby("stock_id").apply(is_consolidating)
The problem is when I print the df I am getting NaN for the values of my new column:
dan#danalgo:~/Documents/code/wolfhound$ python3 add_indicators_daily.py
index date symbol stock_id open high low close volume vwap t
0 0 2021-10-11 BVN 13 7.69 7.98 7.5600 7.61 879710 7.782174 NaN
1 1 2021-10-12 BVN 13 7.67 8.08 7.5803 8.02 794436 7.967061 NaN
2 2 2021-10-13 BVN 13 8.12 8.36 8.0900 8.16 716012 8.231286 NaN
3 3 2021-10-14 BVN 13 8.26 8.29 8.0500 8.28 586091 8.185899 NaN
4 4 2021-10-15 BVN 13 8.18 8.44 8.0600 8.44 1278409 8.284539 NaN
... ... ... ... ... ... ... ... ... ... ... ...
227774 227774 2022-10-04 ERIC 11000 6.27 6.32 6.2400 6.29 14655189 6.280157 NaN
227775 227775 2022-10-05 ERIC 11000 6.17 6.31 6.1500 6.29 10569193 6.219965 NaN
227776 227776 2022-10-06 ERIC 11000 6.20 6.25 6.1800 6.22 7918812 6.217198 NaN
227777 227777 2022-10-07 ERIC 11000 6.17 6.19 6.0800 6.10 9671252 6.135976 NaN
227778 227778 2022-10-10 ERIC 11000 6.13 6.15 6.0200 6.04 6310661 6.066256 NaN
[227779 rows x 11 columns]
Full code:
import pandas as pd
from IPython.display import display
import sqlite3 as sql
import numpy as np
conn = sql.connect('allStockData.db')
# get everything inside daily_ohlc and add to a dataframe
df = pd.read_sql_query("SELECT * from daily_ohlc_init", conn)
def is_consolidating(df, window=2, minp=2, percentage=0.95):
rolling_min = pd.Series(df['close']).rolling(window=window, min_periods=minp).min()
rolling_max = pd.Series(df['close']).rolling(window=window, min_periods=minp).max()
consolidation = np.where( (rolling_min / rolling_max) >= percentage, True, False)
return consolidation
df['t'] = df.groupby("stock_id").apply(is_consolidating)
print(df)
df.to_sql('daily_ohlc_init_with_indicators', if_exists='replace', con=conn, index=True)
You could do it like this:
def is_consolidating(grp, window=2, minp=2, percentage=0.95):
rolling_min = pd.Series(grp).rolling(window=window, min_periods=minp).min()
rolling_max = pd.Series(grp).rolling(window=window, min_periods=minp).max()
consolidation = np.where( (rolling_min / rolling_max) >= percentage, True, False)
return pd.Series(consolidation, index=grp.index)
df['t'] = df.groupby("stock_id")['close'].apply(is_consolidating)
print(df)
Output (part of it):
volume vwap t
0 879710 7.782174 False
1 794436 7.967061 False
2 716012 8.231286 True
3 586091 8.185899 True
4 1278409 8.284539 True
227774 14655189 6.280157 False
227775 10569193 6.219965 True
227776 7918812 6.217198 True
227777 9671252 6.135976 True
227778 6310661 6.066256 True

pandas - groupby multiple values?

i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps

Pandas: How to remove the index column after groupby and unstack?

I've got trouble removing the index column in pandas after groupby and unstack a DataFrame.
My original DataFrame looks like this:
example = pd.DataFrame({'date': ['2016-12', '2016-12', '2017-01', '2017-01', '2017-02', '2017-02', '2017-02'], 'customer': [123, 456, 123, 456, 123, 456, 456], 'sales': [10.5, 25.2, 6.8, 23.4, 29.5, 23.5, 10.4]})
example.head(10)
output:
date
customer
sales
0
2016-12
123
10.5
1
2016-12
456
25.2
2
2017-01
123
6.8
3
2017-01
456
23.4
4
2017-2
123
29.5
5
2017-2
456
23.5
6
2017-2
456
10.4
Note that it's possible to have multiple sales for one customer per month (like in row 5 and 6).
My aim is to convert the DataFrame into an aggregated DataFrame like this:
customer
2016-12
2017-01
2017-02
123
10.5
6.8
29.5
234
25.2
23.4
33.9
My solution so far:
example = example[['date', 'customer', 'sales']].groupby(['date', 'customer']).sum().unstack('date')
example.head(10)
output:
sales
date
2016-12
2017-01
2017-02
customer
123
10.5
6.8
29.5
234
25.2
23.4
33.9
example = example['sales'].reset_index(level=[0])
example.head(10)
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
At this point I'm unable to remove the "date" column:
example.reset_index(drop = True)
example.head()
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
It just stays the same. Have you got any ideas?
An alternative to your solution, but the key is just to add a rename_axis(columns = None), as the date is the name for the columns axis:
(example[["date", "customer", "sales"]]
.groupby(["date", "customer"])
.sum()
.unstack("date")
.droplevel(0, axis="columns")
.rename_axis(columns=None)
.reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9
Why not directly go with pivot_table?
(example
.pivot_table('sales', index='customer', columns="date", aggfunc='sum')
.rename_axis(columns=None).reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9

Adding a row into DataFrame with multiindex

I would like to create a new DataFrame and a bunch of stock data per each date.
Declaring a DataFrame with a multi-index - date and stock ticker.
Adding data for 2020-06-07
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
Adding data for 2020-06-08
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
2020-06-08 AAPL 32.50 34.20 31.1 32.30
2020-06-08 MSFT 58.50 59.20 52.1 53.30
What would be the best and most efficient solution?
Here's my current version that doesn't work as I expect.
df = pd.DataFrame()
for date in dates:
universe500 = get_universe(date) #returns stocks on a specific date
for security in universe500:
prices = data.get_prices(security, ['open','high','low','close'], 1, '1d') # returns pd.DataFrame
df.iloc[(date, security),:] = prices
If prices is a DataFrame formatted in the same manner as the original df, you can use concat:
In[0]:
#consttucting a fake entry
arrays = [['06-07-2020'], ['ABCD']]
multi = pd.MultiIndex.from_arrays(arrays, names=('date', 'stock'))
to_add = pd.DataFrame({'open':1, 'high':2, 'low':3, 'close':4},index=multi)
print(to_add)
Out[0]:
open high low close
date stock
2020-06-09 ABCD 1 2 3 4
In[1]:
#now adding to your data
df = pd.concat([df, to_add])
print(df)
Out[1]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
If the data (prices) were just an array of 4 numbers (the open, high, low, and close) values, then loc would work in the place you used iloc:
In[2]:
df.loc[('2020-06-10','WXYZ'),:] = [10,20,30,40]
Out[2]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
2020-06-10 WXYZ 10.0 20.0 30.0 40.0

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

Categories