How to access column in groupby? - Pandas - python

I have the following pandas DataFrame dt:
auftragskennung sku artikel_bezeichnung summen_netto system_created
0 14 200182 Product 1 -16.64 2015-05-12 19:55:16
1 14 730293 Product 2 -4.16 2015-05-12 19:55:16
2 3 720933 Product 3 0.00 2014-03-25 12:12:44
3 3 192042 Product 4 19.95 2014-03-25 12:12:45
4 3 423902 Product 5 23.88 2014-03-25 12:12:45
I then execute this command to get the best selling products ordered by sku:
topseller = dt.groupby("sku").agg({"summen_netto": np.sum}).sort("summen_netto", ascending=False)
Which returns something like:
summen_netto
sku
730293 55622.24
720933 35603.99
192042 27698.99
423902 26726.28
734630 25730.21
740353 22798.14
This is what I want, but how can I now access the sku column? topseller["sku"] does not work. It always gives me a KeyError.
I would like to be able to do this:
topseller["sku"]["730293"]
Which would then return 55622.24

The sku is now the column so you need to use loc to perform label selection:
In [7]:
topseller.loc[730293]
Out[7]:
summen_netto -4.16
Name: 730293, dtype: float64
You can confirm this here:
In [8]:
topseller.index
Out[8]:
Int64Index([423902, 192042, 720933, 730293, 200182], dtype='int64', name='sku')

Related

Pandas: Groupby and sum customer profit, for every 6 months, starting from users first transaction

I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30

How to represent each user by a unique row (Python)?

I have data like this:
UserId Date Part_of_day Apps Category Frequency Duration_ToT
1 2020-09-10 evening Settings System tool 1 3.436
1 2020-09-11 afternoon Calendar Calendar 5 9.965
1 2020-09-11 afternoon Contacts Phone_and_SMS 7 2.606
2 2020-09-11 afternoon Facebook Social 15 50.799
2 2020-09-11 afternoon clock System tool 2 5.223
3 2020-11-18 morning Contacts Phone_and_SMS 3 1.726
3 2020-11-18 morning Google Productivity 1 4.147
3 2020-11-18 morning Instagram Social 1 0.501
.......................................
67 2020-11-18 morning Truecaller Communication 1 1.246
67 2020-11-18 night Instagram Social 3 58.02
I'am trying to reduce the diemnsionnality of my dataframe to set the entries for k-means.
I'd like to ask it's possible to represent each user by one row ? what do you think to Embedding ?
How can i do please . I can't find any solution
This depends on how you want to aggregate the values. Here is a small example how to do it with groupby and agg.
First I create some sample data.
import pandas as pd
import random
df = pd.DataFrame({
"id": [int(i/3) for i in range(20)],
"val1": [random.random() for _ in range(20)],
"val2": [str(int(random.random()*100)) for _ in range(20)]
})
>>> df.head()
id val1 val2
0 0 0.174553 49
1 0 0.724547 95
2 0 0.369883 3
3 1 0.243191 64
4 1 0.575982 16
>>> df.dtypes
id int64
val1 float64
val2 object
dtype: object
Then we group by the id and aggregate the values according to the functions you specify in the dictionary you pass to agg. In this example I sum up the float values and join the strings with an underscore separator. You could e.g. also pass the list function to store the values in a list.
>>> df.groupby("id").agg({"val1": sum, "val2": "__".join})
val1 val2
id
0 1.268984 49__95__3
1 0.856992 64__16__54
2 2.186370 30__59__21
3 1.486925 29__47__77
4 1.523898 19__78__99
5 0.855413 59__74__73
6 0.201787 63__33
EDIT regarding the comment "But how can we make val2 contain the top 5 applications according to the duration of the application?":
The agg method is restricted in the sense that you cannot access other attributes while aggregating. To do that you should use the apply method. You pass it a function, that processes the whole group and returns a row as Series object.
In this example I still use the sum for val1, but for val2 I return the val2 of the row with the highest val1. This should make clear how to make the aggregation depend on other attributes.
def apply_func(group):
return pd.Series({
"id": group["id"].iat[0],
"val1": group["val1"].sum(),
"val2": group["val2"].iat[group["val1"].argmax()]
})
>>> df.groupby("id").apply(apply_func)
id val1 val2
id
0 0 1.749955 95
1 1 0.344372 65
2 2 2.019035 70
3 3 2.444691 36
4 4 2.573576 92
5 5 1.453769 72
6 6 1.811516 94

How do I convert two DataFrame columns into summed Series?

I have a pandas DataFrame that looks like this:
date sku qty
0 2015-10-30 ABC 1
1 2015-10-30 DEF 1
2 2015-10-30 ABC 2
3 2015-10-31 DEF 1
4 2015-10-31 ABC 1
... ... ... ...
How can extract all of the data for a particular sku and sum up the qty by date. For example, the ABC SKU?
2015-10-30 3
2015-10-31 1
... ...
The closest I've gotten is a hierarchal grouping with sales.groupby(['date', 'sku']).sum().
If you will work with all (or several) sku, then:
agg_df = df.groupby(['sku','date']).qty.sum()
# extract some sku data
agg_df.loc['ABC']
Output:
date
2015-10-30 3
2015-10-31 1
Name: qty, dtype: int64
If you only care for ABC particularly, then it's better to filter it first
df[df['sku'].eq('ABC')].groupby('date')['qty'].sum()
The output would be the same as above.

Python selecting row from second dataframe based on complex criteria

I have two dataframes, one with some purchasing data, and one with a weekly calendar, e.g.
df1:
purchased_at product_id cost
01-01-2017 1 £10
01-01-2017 2 £8
09-01-2017 1 £10
18-01-2017 3 £12
df2:
week_no week_start week_end
1 31-12-2016 06-01-2017
2 07-01-2017 13-01-2017
3 14-01-2017 20-01-2017
I want to use data from the two to add a 'week_no' column to df1, which is selected from df2 based on where the 'purchased_at' date in df1 falls between the 'week_start' and 'week_end' dates in df2, i.e.
df1:
purchased_at product_id cost week_no
01-01-2017 1 £10 1
01-01-2017 2 £8 1
09-01-2017 1 £10 2
18-01-2017 3 £12 3
I've searched but I've not been able to find an example where the data is being pulled from a second dataframe using comparisons between the two, and I've been unable to correctly apply any examples I've found, e.g.
df1.loc[(df1['purchased_at'] < df2['week_end']) &
(df1['purchased_at'] > df2['week_start']), df2['week_no']
was unsuccessful, with the ValueError 'can only compare identically-labeled Series objects'
Could anyone help with this problem, or I'm open to suggestions if there is a better way to achieve the same outcome.
edit to add further detail of df1
df1 full dataframe headers
purchased_at purchase_id product_id product_name transaction_id account_number cost
01-01-2017 1 1 A 1 AA001 £10
01-01-2017 2 2 B 1 AA001 £8
02-01-2017 3 1 A 2 AA008 £10
03-01-2017 4 3 C 3 AB040 £12
...
09-01-2017 12 1 A 10 AB102 £10
09-01-2017 13 2 B 11 AB102 £8
...
18-01-2017 20 3 C 15 AA001 £12
So the purchase_id increases incrementally with each row, the product_id and product_name have a 1:1 relationship, the transaction_id also increases incrementally, but there can be multiple purchases within a transaction.
If your dataframes are to big you can use this trick.
Do a full cartisian product join of all records to all records:
df_out = pd.merge(df1.assign(key=1),df2.assign(key=1),on='key')
Next filter out those records that do not match criteria in this case, where purchased_at is not between week_start and week_end
(df_out.query('week_start < purchased_at < week_end')
.drop(['key','week_start','week_end'], axis=1))
Output:
purchased_at product_id cost week_no
0 2017-01-01 1 £10 1
3 2017-01-01 2 £8 1
7 2017-01-09 1 £10 2
11 2017-01-18 3 £12 3
If you do have large dataframes then you can use this numpy method as proposed by PiRSquared.
a = df1.purchased_at.values
bh = df2.week_end.values
bl = df2.week_start.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
).drop(['week_start','week_end'],axis=1)
Output:
purchased_at product_id cost week_no
0 2017-01-01 00:00:00 1 £10 1
1 2017-01-01 00:00:00 2 £8 1
2 2017-01-09 00:00:00 1 £10 2
3 2017-01-18 00:00:00 3 £12 3
You could just use time.strftime() to extract the week number from the date. If you want to keep counting the weeks upwards, you need to define a "zero year" as the start of your time-series and offset the week_no accordingly:
import pandas as pd
data = {'purchased_at': ['01-01-2017', '01-01-2017', '09-01-2017', '18-01-2017'], 'product_id': [1,2,1,3], 'cost':['£10', '£8', '£10', '£12']}
df = pd.DataFrame(data, columns=['purchased_at', 'product_id', 'cost'])
def getWeekNo(date, year0):
datetime = pd.to_datetime(date, dayfirst=True)
year = int(datetime.strftime('%Y'))
weekNo = int(datetime.strftime('%U'))
return weekNo + 52*(year-year0)
df['week_no'] = df.purchased_at.apply(lambda x: getWeekNo(x, 2017))
Here, I use pd.to_dateime() to convert the datestring from df into a datetime-object. strftime('%Y') returns the year and strftime('%U') the week (with the first week of a year starting on it's first Sunday. If weeks should start on Monday, use '%W' instead).
This way, you don't need to maintain a seperate DataFrame only for week numbers.

Add values in column of Panda Dataframe

I want to add up the values for a particular column.
I have a dataframe loaded from CSV that contains the following data:
Date Item Count Price per Unit Sales
0 1/21/16 Unit A 40 $1.50 $60.00
1 1/22/16 Unit A 20 $1.50 $30.00
2 1/23/16 Unit A 100 $1.50 $150.00
I want to add up all the sales. I've tried:
print sales_df.groupby(["Sales"]).sum()
But it's not adding up the sales. What can I do to make this work?
IIUC you need to sum values from your Sales column. First you need to remove $ with str.replace and then convert to numeric with pd.to_numeric. Then you could use sum. One liner:
pd.to_numeric(df.Sales.str.replace("$", "")).sum()
And step by step:
In [35]: df.Sales
Out[35]:
0 $60.00
1 $30.00
2 $150.00
Name: Sales, dtype: object
In [36]: df.Sales.str.replace("$", "")
Out[36]:
0 60.00
1 30.00
2 150.00
Name: Sales, dtype: object
In [37]: pd.to_numeric(df.Sales.str.replace("$", ""))
Out[37]:
0 60
1 30
2 150
Name: Sales, dtype: float64
In [38]: pd.to_numeric(df.Sales.str.replace("$", "")).sum()
Out[38]: 240.0
Note: pd.to_numeric works only with pandas version >= 0.17.0. If you are using older version take a look to convert_object(convert_numeric=True)

Categories