Dataframe insert sum of other dataframe - python

i have 2 df
df_hist: daily data of share values
df_buy_data: date when share were bought
i want to add the share holdings to df_hist for each data, which calaculate from df_buy_data depending on the date. in my version i have to iterate over the dataframe which works but i guess not so nice...
hist_data={'Date':['2022-01-01','2022-01-02','2022-01-03','2022-01-04'],'Value':[23,22,21,24]}
df_hist=pd.DataFrame(hist_data)
buy_data={'Date':['2022-01-01','2022-01-04'],'Ticker': ['Index1', 'Index1'], 'NumberOfShares':[15,29]}
df_buy_data = pd.DataFrame(buy_data)
for i, historical_row in df_hist.iterrows():
ticker_count = df_buy_data.loc[(df_buy_data['Date'] <= historical_row['Date'])]\
.groupby('Ticker').sum()['NumberOfShares']
if(len(ticker_count)>0):
df_hist.at[i,'Index1_NumberOfShares'] = ticker_count.item()
else:
df_hist.at[i, 'Index1_NumberOfShares'] = 0
df_hist
how can i impove this?
thanks for the help!

Related

How to populate a dataframe from row-by-row calculations?

I am seeking to populate a pandas dataframe row-by-row, whereby each new row is calculated on the basis of the contents of the previous row. I am using this for simple financial projections.
Let us take a dataframe 'df_basic_financials':
df_basic_financials = pd.DataFrame({'current_account': [18357.], 'savings_account': [14809.]})
Now I want to forecast what my current and saving accounts will look like in five years, assuming that I earn 24000 a year and that my saving accounts yields 2% yearly, assuming I spend zero money and do not transfer any money to my savings account.
How do I write the code so that I get this:
current_account savings_account
0 18357 14809
1 42357 15105.18
2 66357 15407.2836
etc... for any number of years I want, each time using the calculation 'value of the previous row in the same column + 24000' for current_account and 'value of the previous row in the same column*1.02' for savings_account.
You can get the input from user on number of years and then run the code this way
import pandas as pd
df = pd.DataFrame({'current_account': [18357], 'savings_account':[14809]})
years = int(input("Enter years: "))
for n in range(years):
lastrow = df.iloc[len(df)-1]
print(lastrow[0], lastrow[1])
df.loc[len(df.index)] = [int(lastrow[0]) +24000, int(lastrow[1])*1.02]
df
Out will be....
Just use math
df_basic_financials = pd.DataFrame({'current_account': [18357.], 'savings_account': [14809.]})
current_account_projection = [df_basic_financials['current_account'].iloc[-1] + (24000 * i) for i in range(10)]
savings_account_projection = [df_basic_financials['savings_account'].iloc[-1] * (1.02 ** i) for i in range(10)]
df_basic_financials = pd.DataFrame({'current_account': current_account_projection, 'savings_account': savings_account_projection})
if you really want an interative solution, apply the function on savings_account.iloc[-1]
current_account_next = df_basic_financials.iloc[-1]['current_account'] + 24000
savings_account_next = df_basic_financials.iloc[-1]['savings_account'] * 1.02
df_basic_financials = df_basic_financials.append(pd.Series({'current_account': current_account_next, 'savings_account': savings_account_next}))

Pandas - groupby and show aggregate on all "levels"

I am a Pandas newbie and I am trying to automate the processing of ticket data we get from our IT ticketing system. After experimenting I was able to get 80 percent of the way to the result I am looking for.
Currently I pull in the ticket data from a CSV into a "df" dataframe. I then want to summarize the data for the higher ups to review and get high level info like totals and average "age" of tickets (number of days between ticket creation date and current date).
Here's an example of the ticket data for "df" dataframe:
I then create "df2" dataframe to summarize df using:
df2 = df.groupby(["dept", "group", "assignee", "ticket_type"]).agg(task_count=('ticket_type', 'size'), mean_age_in_days=('age', 'mean'),)
And here's what it I am getting if I print out df2...which is very close to what I need.
As you can see we look at the count of tickets assigned to each staff member, separated by type (incident, request), and also look at the average "age" of each ticket type (incident, request) for each staff member.
The roadblock that I am hitting now and have been pulling my hair out about is I need to show the aggregates (count and averages of ages) at all 3 levels (sorry if I am using the wrong jargon). Basically I need to show the count and average age for all tickets associated with a group, then the same thing for tickets at the department ("Division") level, and lastly the grand total and grand average in green...for all tickets which is the entire organization (all tickets in all departments, groups).
Here's an example of the ideal result I am trying to get:
You will see in red I want the count of tickets and average age for tickets for a given group. Then, in blue I want the count and average age for all tickets on the dept/division level (all tickets for all groups belonging to a given dept./division). Lastly, I want the grand total and grand average for all tickets in the entire organization. In the end both the df2 (summary of ticket data) and df will be dumped to an Excel file on separate worksheets in the same workbook.
Please have mercy on me! Can someone show me how I could generate the desired "summary" with counts and average age at all levels (group, dept., and organization)? Thanks in advance for any assistance, I'd really, really appreciate it!
*Added link to CSV with sample ticket data below:
on Github
Also, here's raw CSV text for the sample ticket data:
,number,created_on,dept,group,assignee,ticket_type,age
0,14500,2021-02-19 11:48:28,IT_Services_Division,Helpdesk,Jane Doe,Incident,361
1,16890,2021-04-20 10:51:49,IT_Services_Division,Helpdesk,Jane Doe,Incident,120
2,16891,2021-04-20 11:51:00,IT_Services_Division,Helpdesk,Tilly James,Request,120
3,15700,2021-06-09 09:05:28,IT_Services_Division,Systems,Steve Lee,Incident,252
4,16000,2021-08-12 09:32:39,IT_Services_Division,Systems,Linda Nguyen,Request,188
5,16100,2021-08-18 17:43:54,IT_Services_Division,TechSupport,Joseph Wills,Incident,181
6,19000,2021-01-17 15:01:50,IT_Services_Division,TechSupport,Bill Gonzales,Request,30
7,18990,2021-01-10 13:00:01,IT_Services_Division,TechSupport,Bill Gonzales,Request,37
8,18800,2021-12-03 21:13:12,Data_Division,DataGroup,Bob Simpson,Incident,74
9,16880,2021-10-18 11:56:03,Data_Division,DataGroup,Bob Simpson,Request,119
10,18000,2021-11-09 14:28:44,IT_Services_Division,Systems,Veronica Paulson,Incident,98
Here's a different approach which is easier, but results in a different structure
agg_df = df.copy()
#Add dept-level info to the department
gb = agg_df.groupby('dept')
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['dept'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add group-level info to the group label
gb = agg_df.groupby(['dept','group'])
task_counts = gb['ticket_type'].transform('count').astype(str)
mean_ages = gb['age'].transform('mean').round(2).astype(str)
agg_df['group'] += ' ['+task_counts+' tasks, avg age= '+mean_ages+']'
#Add org-level info
agg_df['org'] = 'Org [{} tasks, avg age = {}]'.format(len(agg_df),agg_df['age'].mean().round(2))
agg_df = (
agg_df.groupby(['org','dept','group','assignee','ticket_type']).agg(
task_count=('ticket_type','count'),
mean_ticket_age=('age','mean'))
)
agg_df
Couldn't think of a cleaner way to get the structure you want and had to manually loop through the different groupby levels adding one row at a time
multi_ind = pd.MultiIndex.from_tuples([],names=('dept','group','assignee','ticket_type'))
agg_df = pd.DataFrame(index=multi_ind, columns=['task_count','mean_age_in_days'])
data = lambda df: {'task_count':len(df),'mean_age_in_days':df['age'].mean()}
for dept,dept_g in df.groupby('dept'):
for group,group_g in dept_g.groupby('group'):
for assignee,assignee_g in group_g.groupby('assignee'):
for ticket_type,ticket_g in assignee_g.groupby('ticket_type'):
#Add ticket totals
agg_df.loc[(dept,group,assignee,ticket_type)] = data(ticket_g)
#Add group totals
agg_df.loc[(dept,group,assignee,'Group Total/Avg')] = data(group_g)
#Add dept totals
agg_df.loc[(dept,group,assignee,'Dept Total/Avg')] = data(dept_g)
#Add org totals
agg_df.loc[('','','','Org Total/Avg')] = data(df)
agg_df
Output

Pandas DataFrame: Adding a new column with the average price sold of an Author

I have this dataframe data where i have like 10.000 records of sold items for 201 authors.
I want to add a column to this dataframe which is the average price for each author.
First i create this new column average_price and then i create another dataframe df
where i have 201 columns of authors and their average price. (at least i think this is the right way to do this)
data["average_price"] = 0
df = data.groupby('Author Name', as_index=False)['price'].mean()
df looks like this
Author Name price
0 Agnes Cleve 107444.444444
1 Akseli Gallen-Kallela 32100.384615
2 Albert Edelfelt 207859.302326
3 Albert Johansson 30012.000000
4 Albin Amelin 44400.000000
... ... ...
196 Waldemar Lorentzon 152730.000000
197 Wilhelm von Gegerfelt 25808.510638
198 Yrjö Edelmann 53268.928571
199 Åke Göransson 87333.333333
200 Öyvind Fahlström 351345.454545
Now i want to use this df to populate the average_price column in the larger dataframe data.
I could not come up with how to do this so i tried a for loop which is not working. (And i know you should avoid for loops working with dataframes)
for index, row in data.iterrows():
for ind, r in df.iterrows():
if row["Author Name"] == r["Author Name"]:
row["average_price"] = r["price"]
So i wonder how this should be done?
You can use transform and groupby to add a new column:
data['average price'] = data.groupby('Author Name')['price'].transform('mean')
I think based on what you described, you should use .join method on a Pandas dataframe. You don't need to create 'average_price' column mannualy. This should simply work for your case:
df = data.groupby('Author Name', as_index=False)['price'].mean().rename(columns={'price':'average_price'})
data = data.join(df, on="Author Name")
Now you can get the average price from data['average_price'] column.
Hope this could help!
I think the easiest way to do that would be using join (aka pandas.merge)
df_data = pd.DataFrame([...]) # your data here
df_agg_data = data.groupby('Author Name', as_index=False)['price'].mean()
df_data = df_data.merge(df_agg_data, on="Author Name")
print(df_data)

PythonValueError: Can only compare identically-labeled Series objects

The 2 dataframes I am comparing are of different size (have the same index though) and I suppose that is why I am getting the error. Can you please suggest me a way to get around that. I am looking for those rows in df2 whose user_id match with those of df1. Thanks and appreciate your response.
data = np.array([['user_id','comment','label'],
[100,'RT #Dvillain_: #oomf should text me.',0],
[100,'Buy viagra',1],
[101,'#nowplaying M.C. Shan - Juice Crew Law on',0],
[101,'Buy viagra two',1]])
data2 = np.array([['user_id','comment','label'],
[100,'First comment',0],
[100,'Buy viagra',1],
[102,'Buy viagra two',1]])
df1 = pd.DataFrame(data=data[1:,0:],columns = data[0,0:])
df2 = pd.DataFrame(data=data2[1:,0:],columns = data[0,0:])
df = df2[df2['user_id'] == df1['user_id']]
You are looking for isin
df = df2[df2['user_id'].isin(df1['user_id'])]
df
Out[814]:
user_id comment label
0 100 First comment 0
1 100 Buy viagra 1

Merge dataframe resulting in Series

I working with the Texas Hospital Discharge Dataset and I am trying to determine the top 100 most frequent Principal Surgery Procedures over a period of 4 years.
Do to this I need to go through each quarter of each year and count the procedures, but when I try to merge different quarters the result is a Series not a DataFrame.
top_procedures = None
for year in range(6, 10):
for quarter in range(1, 5):
quarter_data = pd.read_table(
filepath_or_buffer="/path/to/texas/data/PUDF_base"
+ str(quarter) + "q200" + str(year) + "_tab.txt",
)
quarter_data = quarter_data[quarter_data["THCIC_ID"] != 999999]
quarter_data = quarter_data[quarter_data["THCIC_ID"] != 999998]
quarter_procedures = quarter_data["PRINC_SURG_PROC_CODE"].value_counts()
quarter_procedures = pd.DataFrame(
{"PRINC_SURG_PROC_CODE": quarter_procedures.index, "count": quarter_procedures.values})
top_procedures = quarter_procedures if (top_procedures is None) else \
top_procedures.merge(
right=quarter_procedures,
how="outer",
on="PRINC_SURG_PROC_CODE"
).set_index(
["PRINC_SURG_PROC_CODE"]
).sum(
axis=1
)
Could you please tell me what am I doing wrong? From the documentation it looks like it should return a DataFrame.
Cheers,
Dan
the merge will indeed return a dataframe, but in your code you are summing on axis=1 (all values in one row) after merging which then gives you a series (since the values from all columns are summed together in one final column).
Hope that helps.

Categories