I need to calculate the cumulative balance (running), for each combination of two variable for each day (starting with the creation day)
I have a row in the df containing the date of the transaxtion, one with the amount of the transaction and the two variable to combine (country and payment method used)
To calculate the balance, I should use this formula:
Balance = transaction_type==1 - transaction_type==2 - transaction_type==3 + transaction_type==4
I polished the db but I don t know how to set the calculation code combined for the two variable (i was thinking to use groupby.
can someone help me?
Related
I wanted to build a loop to cumulate daily log returns with a yearly reset to 100 on a specific date in January. First Problem I am working with different dataframes.
df = ETF Data and main dataframe for different parts of calculation
Maturity_Date_Option_1 = Dataframe with different maturity dates
-> If Df and Date in Maturity_Data_Option_1 matches it should reset to 100 and calculate the cum daily return onwords till the next match of the two dataframes. :)
I feel like I am near to the answere but missing sth...
Hopefully you could help me with my problem. :)
for t in df.index:
df['NDDUWI_cum_daily_returns'] = (df['NDDUWI_daily_return_log'] + df['NDDUWI_cum_daily_returns'].shift(1))
print(t)
if t in Maturity_Date_Option_1.tolist():
df['NDDUWI_cum_daily_returns'] = 100
else:
df['NDDUWI_cum_daily_returns']```
Is there a more efficient way to structure this task? Sorry
please help me with creating a dataframe.
There is a table containing data: id - user number, FirstVisit - date of the first visit, LastVisit - date of the last visit.
I need to create a table with fields: Month - month, MAU - number of unique users for this month.
I have so far created a list of months, through the minimum and maximum date in the table.
df=pd.DataFrame(pd.date_range(visit.FirstVisit.min(),visit.LastVisit.max(), freq='M').strftime('%Y-%m'),columns=['Month'])
I came up with an approximate way how to calculate MAU, this is the Number of unique ID per month, that have FirstVisit<=Month and LastVisit>=Month.
You need to pay attention to the fact that the user, FirstVisit and LastVisit are always the same, but the record can be repeated, because the columns are missing. That is, in fact, duplicates can be deleted and simply the number of records can be counted.
I tried through the function, but so far it does not work, please help.
I calculate number of quarters gap between two dates. Now, I want to test if the number of quarters gap is bigger than 2.
Thank you for your comments!
I'm actually running a code from WRDS (Wharton Research Data Services). Below, fst_vint is a DataFrame with two date variables, rdate and lag_rdate. First line seems to convert them to quarter variables (e.g., 9/8/2019 to 2019Q1), and then take differences between them, storing it in a new column qtr.
fst_vint.qtr >= 2 creates a problem, because the former is a QuarterEnd object, while the latter is an integer. How do I deal with this problem?
fst_vint['qtr'] = (fst_vint['rdate'].dt.to_period('Q')-\
fst_vint['lag_rdate'].dt.to_period('Q'))
# label first_report flag
fst_vint['first_report'] = ((fst_vint.qtr.isnull()) | (fst_vint.qtr>=2))
Using .diff() when column is converted to integer with .astype(int) results in the desired answer. So the code in your case would be:
fst_vint['qtr'] = fst_vint['rdate'].astype(int).diff()
I'm trying to calculate the weighted average of the "prices" column in the following dataframe for each zone, regardless of hour. I want to essentially sum the quantities that match A, divide each individual quantity row by that amount (to get the weights) and then multiply it by the price.
There are about 200 zones, I'm having a hard time writing something that will generically detect that the Zones match, and not have to write df['ZONE'] = 'A' etc. Please help my lost self =)
HOUR: 1,2,3,1,2,3,1,2,3
ZONE: A,A,A,B,B,B,C,C,C
PRICE: 12,15,16,17,12,11,12,13,15
QUANTITY: 5,6,1 5,7,9 6,3,2
I'm not sure if you can generically write something, but I thought what if I wrote a function where x is my "Zone", create a list with possible zones, and then create a for loop. Here's the function I wrote, doesn't really work - trying to figure out how else I can make it work
def wavgp(x):
df.loc[df['ZONE'].isin([str(x)])] = x
Here is a possible solution using groupby operation:
weighted_price = df.groupby('ZONE').apply(lambda x: (x['PRICE'] * x['QUANTITY']).sum()/x['QUANTITY'].sum())
Explaination
First we groupby zone , for each of these block (of the same zone) we are going to multiply the price by the quantity and sum these values. We divide this result by the sum of the quantity to get your desired result.
ZONE
A 13.833333
B 12.761905
C 12.818182
dtype: float64
I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.
The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)