I am a Pandas newbie and I am trying to automate the processing of ticket data we get from our IT ticketing system. After experimenting I was able to get 80 percent of the way to the result I am looking for.
Currently I pull in the ticket data from a CSV into a "df" dataframe. I then want to summarize the data for the higher ups to review and get high level info like totals and average "age" of tickets (number of days between ticket creation date and current date).
Here's an example of the ticket data for "df" dataframe:
I then create "df2" dataframe to summarize df using:
df2 = df.groupby(["dept", "group", "assignee", "ticket_type"]).agg(task_count=('ticket_type', 'size'), mean_age_in_days=('age', 'mean'),)
And here's what it I am getting if I print out df2...which is very close to what I need.
As you can see we look at the count of tickets assigned to each staff member, separated by type (incident, request), and also look at the average "age" of each ticket type (incident, request) for each staff member.
The roadblock that I am hitting now and have been pulling my hair out about is I need to show the aggregates (count and averages of ages) at all 3 levels (sorry if I am using the wrong jargon). Basically I need to show the count and average age for all tickets associated with a group, then the same thing for tickets at the department ("Division") level, and lastly the grand total and grand average in green...for all tickets which is the entire organization (all tickets in all departments, groups).
Here's an example of the ideal result I am trying to get:
You will see in red I want the count of tickets and average age for tickets for a given group. Then, in blue I want the count and average age for all tickets on the dept/division level (all tickets for all groups belonging to a given dept./division). Lastly, I want the grand total and grand average for all tickets in the entire organization. In the end both the df2 (summary of ticket data) and df will be dumped to an Excel file on separate worksheets in the same workbook.
Please have mercy on me! Can someone show me how I could generate the desired "summary" with counts and average age at all levels (group, dept., and organization)? Thanks in advance for any assistance, I'd really, really appreciate it!
*Added link to CSV with sample ticket data below:
on Github
Also, here's raw CSV text for the sample ticket data:
,number,created_on,dept,group,assignee,ticket_type,age
0,14500,2021-02-19 11:48:28,IT_Services_Division,Helpdesk,Jane Doe,Incident,361
1,16890,2021-04-20 10:51:49,IT_Services_Division,Helpdesk,Jane Doe,Incident,120
2,16891,2021-04-20 11:51:00,IT_Services_Division,Helpdesk,Tilly James,Request,120
3,15700,2021-06-09 09:05:28,IT_Services_Division,Systems,Steve Lee,Incident,252
4,16000,2021-08-12 09:32:39,IT_Services_Division,Systems,Linda Nguyen,Request,188
5,16100,2021-08-18 17:43:54,IT_Services_Division,TechSupport,Joseph Wills,Incident,181
6,19000,2021-01-17 15:01:50,IT_Services_Division,TechSupport,Bill Gonzales,Request,30
7,18990,2021-01-10 13:00:01,IT_Services_Division,TechSupport,Bill Gonzales,Request,37
8,18800,2021-12-03 21:13:12,Data_Division,DataGroup,Bob Simpson,Incident,74
9,16880,2021-10-18 11:56:03,Data_Division,DataGroup,Bob Simpson,Request,119
10,18000,2021-11-09 14:28:44,IT_Services_Division,Systems,Veronica Paulson,Incident,98
Here's a different approach which is easier, but results in a different structure
agg_df = df.copy()
#Add dept-level info to the department
gb = agg_df.groupby('dept')
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['dept'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add group-level info to the group label
gb = agg_df.groupby(['dept','group'])
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['group'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add org-level info
agg_df['org'] = 'Org [{} tasks, avg age = {}]'.format(len(agg_df),agg_df['age'].mean().round(2))
agg_df = (
agg_df.groupby(['org','dept','group','assignee','ticket_type']).agg(
task_count=('ticket_type','count'),
mean_ticket_age=('age','mean'))
)
agg_df
Couldn't think of a cleaner way to get the structure you want and had to manually loop through the different groupby levels adding one row at a time
multi_ind = pd.MultiIndex.from_tuples([],names=('dept','group','assignee','ticket_type'))
agg_df = pd.DataFrame(index=multi_ind, columns=['task_count','mean_age_in_days'])
data = lambda df: {'task_count':len(df),'mean_age_in_days':df['age'].mean()}
for dept,dept_g in df.groupby('dept'):
for group,group_g in dept_g.groupby('group'):
for assignee,assignee_g in group_g.groupby('assignee'):
for ticket_type,ticket_g in assignee_g.groupby('ticket_type'):
#Add ticket totals
agg_df.loc[(dept,group,assignee,ticket_type)] = data(ticket_g)
#Add group totals
agg_df.loc[(dept,group,assignee,'Group Total/Avg')] = data(group_g)
#Add dept totals
agg_df.loc[(dept,group,assignee,'Dept Total/Avg')] = data(dept_g)
#Add org totals
agg_df.loc[('','','','Org Total/Avg')] = data(df)
agg_df
Output
Related
I started (based on the idea shown in this model an actuarial project in Python in which I want to simulate, based on a set of inputs and adding (as done here: https://github.com/Saurabh0503/Financial-modelling-and-valuationn/blob/main/Dynamic%20Salary%20Retirement%20Model%20Internal%20Randomness.ipynb) some degree of internal randomness, how much it will take for an individual to retire, with a certain amount of wealth and a certain amount of annual salary and by submitting a certain annual payment (calculated as the desired cash divided by the years that will be necessary to retire). In my model's variation, the user can define his/her own parameters, making the model more flexible and user friendly; and there is a function that calculates the desired retirement cash based on individual's propensity both to save and spend.
The problem is that since I want to summarize (by taking the mean, max, min and std. deviation of wealth, salary and years to retirement) the output I obtain from the model, I have to save results (and to recall them) when I need to do so; but I don't have idea of what to do in order to accomplish this task.
I tried this solution, consisting in saving the simultation's output in a pandas dataframe. In particular I wrote that function:
def get_salary_wealth_year_case_df(data):
all_ytrs = []
salary = []
wealth = []
annual_payments = []
for i in range(data.n_iter):
ytr = years_to_retirement(data, print_output=False)
sal = salary_at_year(data, year, case, print_output=False)
wlt = wealth_at_year(data, year, prior_wealth, case, print_output=False)
pmt = annual_pmts_case_df(wealth_at_year, year, case, print_output=False)
all_ytrs.append(ytr)
salary.append(sal)
annual_payments.append(pmt)
df = pd.DataFrame()
df['Years to Retirement'] = all_ytrs
df['Salary'] = sal
df['Wealth'] = wlt
df['Annual Payments'] = pmt
return df
I need a feedback about what I'm doing. Am I doing it right? If so, are there more efficient ways to do so? If not, what should I do? Thanks in advance!
Given the inputs used for the function, I'm assuming your code (as it is) will do just fine in terms of computation speed.
As suggested, you can add a saving option to your function so the results that are being returned are stored in a .csv file.
def get_salary_wealth_year_case_df(data, path):
all_ytrs = []
salary = []
wealth = []
annual_payments = []
for i in range(data.n_iter):
ytr = years_to_retirement(data, print_output=False)
sal = salary_at_year(data, year, case, print_output=False)
wlt = wealth_at_year(data, year, prior_wealth, case, print_output=False)
pmt = annual_pmts_case_df(wealth_at_year, year, case, print_output=False)
all_ytrs.append(ytr)
salary.append(sal)
annual_payments.append(pmt)
df = pd.DataFrame()
df['Years to Retirement'] = all_ytrs
df['Salary'] = sal
df['Wealth'] = wlt
df['Annual Payments'] = pmt
# Save the dataframe to a given path inside your workspace
df.to_csv(path, header=False)
return df
After saving, returning the object might be optional. This depends on if you are going to use this dataframe on your code moving forward.
I am practicing data analytics and I am stuck on one problem.
TRAINING DATAFRAME
I group the dataframe by the Date Purchased and set it to unique because I want to count the unique value for each date purchased.
training.groupby('DATE PURCHASED')['Account - Store Name'].unique().to_frame()
So it looks like this:
GROUPBY DATE PURCHASED
Now that the data has been aggregated, I want to count the items in that column, so I used.split(',').
training_groupby['Account - Store Name'].apply(lambda x: x.split(','))
but I got error:
AttributeError: 'numpy.ndarray' object has no attribute 'split'
Can someone help me, with how to count the number of unique values per Date Purchased. I've been trying to solve this for almost a week now. I tried to search on Youtube and Google it. But I can't find anything that will help me.
I think this is what you want?
training_groupby["Total Purchased"] = training_groupby["Account - Store Name"].apply(lambda x: len(set(x)))
You can do multiple aggregations in the same pandas.DataFrame.groupby clause :
Try this :
out = (training
.groupby(['DATE PURCHASED'])
.agg(**{
'Account - Store Name': ('Account - Store Name', 'unique'),
'Items Count': ('Account - Store Name', 'nunique'),
})
)
# Output :
print(out)
Account - Store Name Items Count
DATE PURCHASED
13/01/2022 [Landmark Makati, Landmark Nuvali] 2
14/01/2022 [Landmark Nuvali] 1
15/01/2022 [Robinsons Dolores, Landmark Nuvali] 2
16/01/2022 [Robinsons Ilocos Norte, Landmarj Trinoma] 2
19/01/2022 [Shopwise Alabang] 1
i have 2 df
df_hist: daily data of share values
df_buy_data: date when share were bought
i want to add the share holdings to df_hist for each data, which calaculate from df_buy_data depending on the date. in my version i have to iterate over the dataframe which works but i guess not so nice...
hist_data={'Date':['2022-01-01','2022-01-02','2022-01-03','2022-01-04'],'Value':[23,22,21,24]}
df_hist=pd.DataFrame(hist_data)
buy_data={'Date':['2022-01-01','2022-01-04'],'Ticker': ['Index1', 'Index1'], 'NumberOfShares':[15,29]}
df_buy_data = pd.DataFrame(buy_data)
for i, historical_row in df_hist.iterrows():
ticker_count = df_buy_data.loc[(df_buy_data['Date'] <= historical_row['Date'])]\
.groupby('Ticker').sum()['NumberOfShares']
if(len(ticker_count)>0):
df_hist.at[i,'Index1_NumberOfShares'] = ticker_count.item()
else:
df_hist.at[i, 'Index1_NumberOfShares'] = 0
df_hist
how can i impove this?
thanks for the help!
I have downloaded the ASCAP database, giving me a CSV that is too large for Excel to handle. I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. Each song title has 3+ rows associated with it:
The first row include the % share that ASCAP has in that song.
The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song.
The first column of each row contains a song title.
This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it.
What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data.
So instead of:
TITLE, ROLE_TYPE, NAME, SHARES, NOTE
I would like to change the data to:
TITLE, WRITER, PERFORMER, SHARES, NOTE
Here is a sample of the data:
TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,
I would like the data to look like:
TITLE, WRITER, PERFORMER, SHARES, NOTE
SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100,
I'm using python/pandas to try and work with the data. I am able to use groupby('TITLE') to group rows with matching titles.
import pandas as pd
data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)
title_grouped = data.groupby('TITLE')
for TITLE,group in title_grouped:
print(TITLE)
print(group)
I was able to groupby('TITLE') of each song, and the output I get seems close to what I want:
SCORE MORE
TITLE ROLE_TYPE NAME SHARES NOTE
0 SCORE MORE ASCAP Total Current ASCAP Share 100.0 NaN
1 SCORE MORE W SMITH ANTONIO RENARD NaN NaN
2 SCORE MORE P SMITH SHOW PUBLISHING NaN NaN
What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song?
I would recommend:
Decompose the data by the ROLE_TYPE
Prepare the data for merge (rename columns and drop unnecessary columns)
Merge everything back into one DataFrame
Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case).
Seems to work nicely :)
data = pd.read_csv("data2.csv", sep=",")
# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()
# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")
# Print result
print(result)
I want to calculate a group specific ratio gathered from two datasets.
The two Dataframes are read from a database with
leases = pd.read_sql_query(sql, connection)
sales = pd.read_sql_query(sql, connection)
one for real estate offered for sale, the other for rented objects.
Then I group both of them by their city and the category I'm interested in:
leasegroups = leases.groupby(['IDconjugate', "city"])
salegroups = sales.groupby(['IDconjugate', "city"])
Now I want to know the ratio between the cheapest rental object per category and city and the most expensively sold object to obtain a lower bound for possible return:
minlease = leasegroups['price'].min()
maxsale = salegroups['price'].max()
ratios = minlease*12/maxsale
I get an output like: Category - City: Ratio
But I cannot access the ratio object by city nor category. I tried creating a new dataframe with:
newframe = pd.DataFrame({"Minleases" : minlease,"Maxsales" : maxsale,"Ratios" : ratios})
newframe = newframe.loc[newframe['Ratios'].notnull()]
which gives me the correct rows, and newframe.index returns the groups.
index.name gives ['IDconjugate', 'city'] but indexing results in a KeyError. How can I make an index out of the different groups: ID0+city1, ID0+city2 etc... ?
EDIT:
The output looks like this:
Maxsales Minleases Ratios
IDconjugate city
1 argeles gazost 59500 337 0.067966
chelles 129000 519 0.048279
enghien-les-bains 143000 696 0.058406
esbly 117990 495 0.050343
foix 58000 350 0.072414
The goal was to select the top ratios and plot them with bokeh, which takes a
dataframe object and plots a column versus an index as I understand it:
topselect = ratio.loc[ratio["Ratios"] > ratio["Ratios"].quantile(quant)]
dots = Dot(topselect, values='Ratios', label=topselect.index, tools=[hover,],
title="{}% best minimal Lease/Sale Ratios per City and Group".format(topperc*100), width=600)
I really only needed the index as a list in the original order, so the following worked:
ids = []
cities = []
for l in topselect.index:
ids.append(str(int(l[0])))
cities.append(l[1])
newind = [i+"_"+j for i,j in zip(ids, cities)]
topselect.index = newind
Now the plot shows 1_city1 ... 1_city2 ... n_cityX on the x-axis. But I figure there must be some obvious way inside the pandas framework that I'm missing.