How to count the unique values per date using python - python

I am practicing data analytics and I am stuck on one problem.
TRAINING DATAFRAME
I group the dataframe by the Date Purchased and set it to unique because I want to count the unique value for each date purchased.
training.groupby('DATE PURCHASED')['Account - Store Name'].unique().to_frame()
So it looks like this:
GROUPBY DATE PURCHASED
Now that the data has been aggregated, I want to count the items in that column, so I used.split(',').
training_groupby['Account - Store Name'].apply(lambda x: x.split(','))
but I got error:
AttributeError: 'numpy.ndarray' object has no attribute 'split'
Can someone help me, with how to count the number of unique values per Date Purchased. I've been trying to solve this for almost a week now. I tried to search on Youtube and Google it. But I can't find anything that will help me.

I think this is what you want?
training_groupby["Total Purchased"] = training_groupby["Account - Store Name"].apply(lambda x: len(set(x)))

You can do multiple aggregations in the same pandas.DataFrame.groupby clause :
Try this :
out = (training
.groupby(['DATE PURCHASED'])
.agg(**{
'Account - Store Name': ('Account - Store Name', 'unique'),
'Items Count': ('Account - Store Name', 'nunique'),
})
)
# Output :
print(out)
Account - Store Name Items Count
DATE PURCHASED
13/01/2022 [Landmark Makati, Landmark Nuvali] 2
14/01/2022 [Landmark Nuvali] 1
15/01/2022 [Robinsons Dolores, Landmark Nuvali] 2
16/01/2022 [Robinsons Ilocos Norte, Landmarj Trinoma] 2
19/01/2022 [Shopwise Alabang] 1

Related

Python Contains in Panda DATAFRAME

I am grouping in each iteration the same price, add quantity together and combine the name of the exchange like :
asks_price asks_qty exchange_name_ask bids_price bids_qty exchange_name_bid
0 20156.51 0.000745 Coinbase 20153.28 0.000200 Coinbase
1 20157.52 0.050000 Coinbase 20152.27 0.051000 Coinbase
2 20158.52 0.050745 CoinbaseFTX 20151.28 0.051200 KrakenCoinbase
but to build orderbook i have to drop each time the row of one of the provider to update it so i do :
self.global_orderbook = self.global_orderbook[
self.global_orderbook.exchange_name_ask != name]
And then i have with Coinbase for example
asks_price asks_qty exchange_name_ask bids_price bids_qty exchange_name_bid
0 20158.52 0.050745 CoinbaseFTX 20151.28 0.051200 KrakenCoinbase
But i want that KrakenCoinbase also leave
so I want to do something like :
self.global_orderbook = self.global_orderbook[name not in self.global_orderbook.exchange_name_ask]
It doesnt work
I already try with contains but i cant on a series
self.global_orderbook = self.global_orderbook[self.global_orderbook.exchange_name_ask.contains(name)]
but
'Series' object has no attribute 'contains'
Thanks for help
To do that we can use astype(str)
like :
self.global_orderbook = self.global_orderbook[self.global_orderbook.exchange_name_ask.astype(str).str.contains(name,regex=False)]
And then it works we can use on column with string

Dataframe insert sum of other dataframe

i have 2 df
df_hist: daily data of share values
df_buy_data: date when share were bought
i want to add the share holdings to df_hist for each data, which calaculate from df_buy_data depending on the date. in my version i have to iterate over the dataframe which works but i guess not so nice...
hist_data={'Date':['2022-01-01','2022-01-02','2022-01-03','2022-01-04'],'Value':[23,22,21,24]}
df_hist=pd.DataFrame(hist_data)
buy_data={'Date':['2022-01-01','2022-01-04'],'Ticker': ['Index1', 'Index1'], 'NumberOfShares':[15,29]}
df_buy_data = pd.DataFrame(buy_data)
for i, historical_row in df_hist.iterrows():
ticker_count = df_buy_data.loc[(df_buy_data['Date'] <= historical_row['Date'])]\
.groupby('Ticker').sum()['NumberOfShares']
if(len(ticker_count)>0):
df_hist.at[i,'Index1_NumberOfShares'] = ticker_count.item()
else:
df_hist.at[i, 'Index1_NumberOfShares'] = 0
df_hist
how can i impove this?
thanks for the help!

Pandas - groupby and show aggregate on all "levels"

I am a Pandas newbie and I am trying to automate the processing of ticket data we get from our IT ticketing system. After experimenting I was able to get 80 percent of the way to the result I am looking for.
Currently I pull in the ticket data from a CSV into a "df" dataframe. I then want to summarize the data for the higher ups to review and get high level info like totals and average "age" of tickets (number of days between ticket creation date and current date).
Here's an example of the ticket data for "df" dataframe:
I then create "df2" dataframe to summarize df using:
df2 = df.groupby(["dept", "group", "assignee", "ticket_type"]).agg(task_count=('ticket_type', 'size'), mean_age_in_days=('age', 'mean'),)
And here's what it I am getting if I print out df2...which is very close to what I need.
As you can see we look at the count of tickets assigned to each staff member, separated by type (incident, request), and also look at the average "age" of each ticket type (incident, request) for each staff member.
The roadblock that I am hitting now and have been pulling my hair out about is I need to show the aggregates (count and averages of ages) at all 3 levels (sorry if I am using the wrong jargon). Basically I need to show the count and average age for all tickets associated with a group, then the same thing for tickets at the department ("Division") level, and lastly the grand total and grand average in green...for all tickets which is the entire organization (all tickets in all departments, groups).
Here's an example of the ideal result I am trying to get:
You will see in red I want the count of tickets and average age for tickets for a given group. Then, in blue I want the count and average age for all tickets on the dept/division level (all tickets for all groups belonging to a given dept./division). Lastly, I want the grand total and grand average for all tickets in the entire organization. In the end both the df2 (summary of ticket data) and df will be dumped to an Excel file on separate worksheets in the same workbook.
Please have mercy on me! Can someone show me how I could generate the desired "summary" with counts and average age at all levels (group, dept., and organization)? Thanks in advance for any assistance, I'd really, really appreciate it!
*Added link to CSV with sample ticket data below:
on Github
Also, here's raw CSV text for the sample ticket data:
,number,created_on,dept,group,assignee,ticket_type,age
0,14500,2021-02-19 11:48:28,IT_Services_Division,Helpdesk,Jane Doe,Incident,361
1,16890,2021-04-20 10:51:49,IT_Services_Division,Helpdesk,Jane Doe,Incident,120
2,16891,2021-04-20 11:51:00,IT_Services_Division,Helpdesk,Tilly James,Request,120
3,15700,2021-06-09 09:05:28,IT_Services_Division,Systems,Steve Lee,Incident,252
4,16000,2021-08-12 09:32:39,IT_Services_Division,Systems,Linda Nguyen,Request,188
5,16100,2021-08-18 17:43:54,IT_Services_Division,TechSupport,Joseph Wills,Incident,181
6,19000,2021-01-17 15:01:50,IT_Services_Division,TechSupport,Bill Gonzales,Request,30
7,18990,2021-01-10 13:00:01,IT_Services_Division,TechSupport,Bill Gonzales,Request,37
8,18800,2021-12-03 21:13:12,Data_Division,DataGroup,Bob Simpson,Incident,74
9,16880,2021-10-18 11:56:03,Data_Division,DataGroup,Bob Simpson,Request,119
10,18000,2021-11-09 14:28:44,IT_Services_Division,Systems,Veronica Paulson,Incident,98
Here's a different approach which is easier, but results in a different structure
agg_df = df.copy()
#Add dept-level info to the department
gb = agg_df.groupby('dept')
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['dept'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add group-level info to the group label
gb = agg_df.groupby(['dept','group'])
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['group'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add org-level info
agg_df['org'] = 'Org [{} tasks, avg age = {}]'.format(len(agg_df),agg_df['age'].mean().round(2))
agg_df = (
agg_df.groupby(['org','dept','group','assignee','ticket_type']).agg(
task_count=('ticket_type','count'),
mean_ticket_age=('age','mean'))
)
agg_df
Couldn't think of a cleaner way to get the structure you want and had to manually loop through the different groupby levels adding one row at a time
multi_ind = pd.MultiIndex.from_tuples([],names=('dept','group','assignee','ticket_type'))
agg_df = pd.DataFrame(index=multi_ind, columns=['task_count','mean_age_in_days'])
data = lambda df: {'task_count':len(df),'mean_age_in_days':df['age'].mean()}
for dept,dept_g in df.groupby('dept'):
for group,group_g in dept_g.groupby('group'):
for assignee,assignee_g in group_g.groupby('assignee'):
for ticket_type,ticket_g in assignee_g.groupby('ticket_type'):
#Add ticket totals
agg_df.loc[(dept,group,assignee,ticket_type)] = data(ticket_g)
#Add group totals
agg_df.loc[(dept,group,assignee,'Group Total/Avg')] = data(group_g)
#Add dept totals
agg_df.loc[(dept,group,assignee,'Dept Total/Avg')] = data(dept_g)
#Add org totals
agg_df.loc[('','','','Org Total/Avg')] = data(df)
agg_df
Output

How can I get the sum of one column based on year, which is stored in another column?

I have this code.
cheese_sums = []
for year in milk_products.groupby(milk_products['Date']):
total = milk_products[milk_products['Date'] == year]['Cheddar Cheese Production (Thousand Tonnes)'].sum()
cheese_sums.append(total)
print(cheese_sums)
I am trying to sum all the Cheddar Cheese Production, which are stored as floats in the milk_products data frame. The Date column is a datetime object that holds only the year, but has 12 values representing each month. As it's written now, I can only print a list of six 0.0's.
I got it. It should be:
cheese_sums = []
for year in milk_products['Date']:
total = milk_products[milk_products['Date'] == year]['Cheddar Cheese Production (Thousand Tonnes)'].sum()
if total not in cheese_sums:
cheese_sums.append(total)
print(cheese_sums)
You seem to think too complicated.
Try groupby(...).sum()
df = milk_products.groupby('Date').sum()

Pandas DataFrame: Adding a new column with the average price sold of an Author

I have this dataframe data where i have like 10.000 records of sold items for 201 authors.
I want to add a column to this dataframe which is the average price for each author.
First i create this new column average_price and then i create another dataframe df
where i have 201 columns of authors and their average price. (at least i think this is the right way to do this)
data["average_price"] = 0
df = data.groupby('Author Name', as_index=False)['price'].mean()
df looks like this
Author Name price
0 Agnes Cleve 107444.444444
1 Akseli Gallen-Kallela 32100.384615
2 Albert Edelfelt 207859.302326
3 Albert Johansson 30012.000000
4 Albin Amelin 44400.000000
... ... ...
196 Waldemar Lorentzon 152730.000000
197 Wilhelm von Gegerfelt 25808.510638
198 Yrjö Edelmann 53268.928571
199 Åke Göransson 87333.333333
200 Öyvind Fahlström 351345.454545
Now i want to use this df to populate the average_price column in the larger dataframe data.
I could not come up with how to do this so i tried a for loop which is not working. (And i know you should avoid for loops working with dataframes)
for index, row in data.iterrows():
for ind, r in df.iterrows():
if row["Author Name"] == r["Author Name"]:
row["average_price"] = r["price"]
So i wonder how this should be done?
You can use transform and groupby to add a new column:
data['average price'] = data.groupby('Author Name')['price'].transform('mean')
I think based on what you described, you should use .join method on a Pandas dataframe. You don't need to create 'average_price' column mannualy. This should simply work for your case:
df = data.groupby('Author Name', as_index=False)['price'].mean().rename(columns={'price':'average_price'})
data = data.join(df, on="Author Name")
Now you can get the average price from data['average_price'] column.
Hope this could help!
I think the easiest way to do that would be using join (aka pandas.merge)
df_data = pd.DataFrame([...]) # your data here
df_agg_data = data.groupby('Author Name', as_index=False)['price'].mean()
df_data = df_data.merge(df_agg_data, on="Author Name")
print(df_data)

Categories