Pandas: How to remove the index column after groupby and unstack? - python

I've got trouble removing the index column in pandas after groupby and unstack a DataFrame.
My original DataFrame looks like this:
example = pd.DataFrame({'date': ['2016-12', '2016-12', '2017-01', '2017-01', '2017-02', '2017-02', '2017-02'], 'customer': [123, 456, 123, 456, 123, 456, 456], 'sales': [10.5, 25.2, 6.8, 23.4, 29.5, 23.5, 10.4]})
example.head(10)
output:
date
customer
sales
0
2016-12
123
10.5
1
2016-12
456
25.2
2
2017-01
123
6.8
3
2017-01
456
23.4
4
2017-2
123
29.5
5
2017-2
456
23.5
6
2017-2
456
10.4
Note that it's possible to have multiple sales for one customer per month (like in row 5 and 6).
My aim is to convert the DataFrame into an aggregated DataFrame like this:
customer
2016-12
2017-01
2017-02
123
10.5
6.8
29.5
234
25.2
23.4
33.9
My solution so far:
example = example[['date', 'customer', 'sales']].groupby(['date', 'customer']).sum().unstack('date')
example.head(10)
output:
sales
date
2016-12
2017-01
2017-02
customer
123
10.5
6.8
29.5
234
25.2
23.4
33.9
example = example['sales'].reset_index(level=[0])
example.head(10)
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
At this point I'm unable to remove the "date" column:
example.reset_index(drop = True)
example.head()
output:
date
customer
2016-12
2017-01
2017-02
0
123
10.5
6.8
29.5
1
234
25.2
23.4
33.9
It just stays the same. Have you got any ideas?

An alternative to your solution, but the key is just to add a rename_axis(columns = None), as the date is the name for the columns axis:
(example[["date", "customer", "sales"]]
.groupby(["date", "customer"])
.sum()
.unstack("date")
.droplevel(0, axis="columns")
.rename_axis(columns=None)
.reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9

Why not directly go with pivot_table?
(example
.pivot_table('sales', index='customer', columns="date", aggfunc='sum')
.rename_axis(columns=None).reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9

Related

How to merge 2 dataframes by aligning the values in 2 columns?

I have the following dataframe df to process:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 211 1.13 122 2.53
2 A 242 1.22 211 1.13
3 A 245 3.87 242 1.38
4 A 311 3.13 243 4.00
5 A 312 7.11 311 2.07
6 A NaN NaN 312 7.11
7 A NaN NaN 324 1.06
As you can see, the 2 columns of "codes", C1 and C2, are not aligned on the same levels: codes 122, 243, 324 (in column C2) do not appear in column C1, and code 245 (in column C1) does not appear in column C2.
I would like to reconstruct a file where the codes are aligned according to their value, so as to obtain this:
Name C1 Value_1 C2 Value_2
0 A 112 2.36 112 3.77
1 A 122 NaN 122 2.53
2 A 211 1.13 211 1.13
3 A 242 1.22 242 1.38
4 A 243 NaN 243 4.00
5 A 245 3.87 245 NaN
6 A 311 3.13 311 2.07
7 A 312 7.11 312 7.11
8 A 324 NaN 324 1.06
In order to do so, I thought of creating 2 subsets:
left = df[['Name', 'C1', 'Value_1']]
right = df[['Name', 'C2', 'Value_2']]
and I tried to merge them, manipulating the function merge:
left.merge(right, on=..., how=..., suffixes=...)
but I got lost in the parameters that should be used to achieve the result.
What do you think would be the best way to do it?
Appendix:
In order to create the initial dataframe, one could use:
names = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
code1 = [112,211,242,245,311,312,np.nan,np.nan]
zone1 = [2.36, 1.13, 1.22, 3.87, 3.13, 7.11, np.nan, np.nan]
code2 = [112,122,211,242,243,311,312,324]
zone2 = [3.77, 2.53, 1.13, 1.38, 4.00, 2.07, 7.11, 1.06]
df = pd.DataFrame({'Name': names, 'C1': code1, 'Value_1': zone1, 'C2': code2, 'Value_2': zone2})
You are almost there:
left.merge(right, right_on = "C2", left_on = "C1", how="right").fillna(0)
Output
Name_x
C1
Value_1
Name_y
C2
Value_2
A
112
2.36
A
112
3.77
0
0
0
A
122
2.53
A
211
1.13
A
211
1.13
A
242
1.22
A
242
1.38
0
0
0
A
243
4
A
311
3.13
A
311
2.07
A
312
7.11
A
312
7.11
0
0
0
A
324
1.06
IIUC, you can perform an outer merge, then dropna on the missing values:
(df[['Name', 'C1', 'Value_1']]
.merge(df[['Name', 'C2', 'Value_2']],
left_on=['Name', 'C1'], right_on=['Name', 'C2'], how='outer')
.dropna(subset=['C1', 'C2'], how='all')
)
output:
Name C1 Value_1 C2 Value_2
0 A 112.0 2.36 112.0 3.77
1 A 211.0 1.13 211.0 1.13
2 A 242.0 1.22 242.0 1.38
3 A 245.0 3.87 NaN NaN
4 A 311.0 3.13 311.0 2.07
5 A 312.0 7.11 312.0 7.11
8 A NaN NaN 122.0 2.53
9 A NaN NaN 243.0 4.00
10 A NaN NaN 324.0 1.06

Calculate average revenue per user per month

I have the following dataframe:
Timestamp userid Prices_USD
0 2016-12-01 6.270941895 1.08
1 2016-12-01 6.609813209 1.12
2 2016-12-01 6.632094115 9.70
3 2016-12-01 6.655789772 1.08
4 2016-12-01 6.764640751 9.33
... ... ... ...
1183 2017-03-27 6.529604089 1.08
1184 2017-03-27 6.682639674 6.72
1185 2017-03-27 6.773815105 10.0
I want to calculate, for each unique userid, their monthly spending.
I've tried the following:
sales_per_user.set_index('Timestamp',inplace=True)
sales_per_user.index = pd.to_datetime(sales_per_user.index)
m = sales_per_user.index.month
monthly_avg = sales_per_user.groupby(['userid', m]).Prices_USD.mean().to_frame()
But the resulting dataframe is this:
userid Timestamp Prices_USD
3.43964843 12 10.91
3.885813375 1 10.91
2 10.91
12 21.82
However, the timestamp column doesn't have the desired outcome. Ideally I would like
userid Timestamp Prices_USD
3.43964843 2016-12 10.91
3.885813375 2017-01 10.91
2017-02 10.91
2017-12 21.82
How do I fix that?
Try:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
res = df.groupby([df['userid'], df['Timestamp'].dt.to_period('M')])['Prices_USD'].sum()
print(res)
Output
userid Timestamp
6.270942 2016-12 1.08
6.529604 2017-03 1.08
6.609813 2016-12 1.12
6.632094 2016-12 9.70
6.655790 2016-12 1.08
6.682640 2017-03 6.72
6.764641 2016-12 9.33
6.773815 2017-03 10.00
Name: Prices_USD, dtype: float64

pandas ffill() with groupby

I have the following dataframe with 22 columns:
ID S0 S1 S2 .....
ABC 10.4 5.58
ABC 12.6
ABC 8.45
LMN 5.6
LMN 8.7
I have to ffill() the values based on groups. Intended result:
ID SS RR S2 ...
ABC 10.4 5.58
ABC 12.6 10.4 5.58
ABC 12.6 10.4 8.45
LMN 5.6
LMN 8.7 5.6
I am using the following code to get S0,S1... values:
df[['Resistance', 'cumcountR']].pivot(columns='cumcountR').droplevel(0, axis=1).add_prefix('R').drop(columns='R-1.0').ffill()
Little help will be appreciated. THANKS!
Try with GroupBy.ffill
out = df.groupby('ID').ffill()

Adding a row into DataFrame with multiindex

I would like to create a new DataFrame and a bunch of stock data per each date.
Declaring a DataFrame with a multi-index - date and stock ticker.
Adding data for 2020-06-07
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
Adding data for 2020-06-08
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
2020-06-08 AAPL 32.50 34.20 31.1 32.30
2020-06-08 MSFT 58.50 59.20 52.1 53.30
What would be the best and most efficient solution?
Here's my current version that doesn't work as I expect.
df = pd.DataFrame()
for date in dates:
universe500 = get_universe(date) #returns stocks on a specific date
for security in universe500:
prices = data.get_prices(security, ['open','high','low','close'], 1, '1d') # returns pd.DataFrame
df.iloc[(date, security),:] = prices
If prices is a DataFrame formatted in the same manner as the original df, you can use concat:
In[0]:
#consttucting a fake entry
arrays = [['06-07-2020'], ['ABCD']]
multi = pd.MultiIndex.from_arrays(arrays, names=('date', 'stock'))
to_add = pd.DataFrame({'open':1, 'high':2, 'low':3, 'close':4},index=multi)
print(to_add)
Out[0]:
open high low close
date stock
2020-06-09 ABCD 1 2 3 4
In[1]:
#now adding to your data
df = pd.concat([df, to_add])
print(df)
Out[1]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
If the data (prices) were just an array of 4 numbers (the open, high, low, and close) values, then loc would work in the place you used iloc:
In[2]:
df.loc[('2020-06-10','WXYZ'),:] = [10,20,30,40]
Out[2]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
2020-06-10 WXYZ 10.0 20.0 30.0 40.0

How to filter the column of one DataFrame by the value of another DataFrame

Here is my df1, that has 365 values for 365 days in a year
Date Data_Value
0 01-01 33.3
1 01-02 30.6
2 01-03 31.1
3 01-04 30.0
4 01-05 30.0
5 01-06 31.1
6 01-07 31.1
And here is my df2 which has 7262 rows of data containing several rows for one date of the year.
mnth_day Data_Value
0 10-07 30.0
1 10-18 21.1
2 02-25 14.4
3 03-22 28.9
4 02-25 28.9
5 09-03 24.4
6 11-19 25.0
I need to check in how many days 'Data_Value' in df2 exceeds 'Data_Value' in df1. How can I do that?
My expected output will look like:
Date Data_Value
0 01-01 33.3
1 01-02 30.6
2 01-03 31.1
3 01-04 30.0
4 01-05 30.0
5 01-06 31.1
6 01-07 31.1
Here the 'Date' column will be in which date the 'Data_Value' in df2 exceeds the 'Data_Value' of df1. and 'Data_Value' column will be that biggest 'Data_Value'.

Categories