I have a dataframe similar to the one shown below and was wondering how I can loop through and calculate fitting parameters every set number of days. For example, I would like to be able to input 30 days and have be able to get new constants for the first 30 days, then the first 60 days and so on until the end of the date range.
ID date amount delta_t
1 2020/1/1 10.2 0
1 2020/1/2 11.2 1
2 2020/1/1 12.3 0
2 2020/1/2 13.3 1
I would like to have the parameters stored in another dataframe which is what I am currently doing for the entire dataset but that is over the whole time period rather than n day blocks. Then using the constants for each set period I will calculate the graph points and plot them.
Right now I am using groupby to group the wells by ID then using the apply method to calculate the constants for each ID. This works for the entire dataframe but the constants will change if I am only using 30 day periods.
I don't know if there is a way in the apply method to more easily do this and output the constants either to a new column or a seperate dataframe that is one row per ID. Any input is greatly appreciated.
def parameters(x):
variables, _ = curve_fit(expo, x['delta_t'], x['amount'])
return pd.Series({'param1': variables[0], 'param2': variables[1], 'param3': variables[2]})
param_series = df_filt.groupby('ID').apply(parameters)
I have a dataframe that represents time series probabilities. Each value in column 'Single' represents the probability of that event in that time period (where each row represents one time period). Each value in column 'Cumulative' represents the probability of that event occurring every time period until that point (ie it is the product of every value in 'Single' from time 0 until now).
A simplified version of the dataframe looks like this:
Single Cumulative
0 0.990000 1.000000
1 0.980000 0.990000
2 0.970000 0.970200
3 0.960000 0.941094
4 0.950000 0.903450
5 0.940000 0.858278
6 0.930000 0.806781
7 0.920000 0.750306
8 0.910000 0.690282
9 0.900000 0.628157
10 0.890000 0.565341
In order to calculate the 'Cumulative' column based on the 'Single' column I am looping through the dataframe like this:
for index, row in df.iterrows():
df['Cumulative'][index] = df['Single'][:index].prod()
In reality, there is a lot of data and looping is a drag on performance, is it at all possible to achieve this without looping?
I've tried to find a way to vectorize this calculation or even use the pandas.DataFrame.apply function, but I don't believe I'm able to reference the current index value in either of those methods.
There's a built in function for this in Pandas:
df.cumprod()
I wish to only round values in the DataFrame for display purposes, when I use head() or tail() but I want the DataFrame to retain the original values.
I tried using the round method but it changes the values in the original DataFrame. I don't wish to create a separate copy each time for this purpose.
Is there any other way than creating a separate copy?
I'm having trouble glancing at values because some columns have e^10 notations. I'd just like to have a look at two to three decimal places maximum and not keep glancing at exponent values.
You can temporarily change the display option:
with pd.option_context('display.precision', 3):
print(df.head())
0 1 2 3 4
0 -0.462 -0.698 -2.030 0.766 -1.670
1 0.925 0.603 -1.062 1.026 -0.096
2 0.589 0.819 -1.040 -0.162 2.467
3 -1.169 0.637 -0.435 0.584 1.232
4 -0.704 -0.623 1.226 0.507 0.507
Or change it permanently:
pd.set_option('display.precision', 3)
A simple print(df.head().round(3)) would also work in this case. They will not change the DataFrame in place.
I have a dataframe that looks like the attached picture.
i want to write a function that returns the last entry of each row : 30.35, 76.06, 1.53
i can do this for each line, but not for the entire dataframe:
DataFrame.loc[DataFrame[[('Price', 'A')]].last_valid_index()][[('Price', 'A')]][0]
additionally i want to take the difference of the last two entries per column, and the average of the last two entries per column. The uneven-ness of the dataframe is killing me. also i'm brand new to python.
Price
Security A B C
Date
12/31/2016 60.5 76.0351 0.83
1/31/2017 59.5 75.7433 -0.01
2/28/2017 63.15 75.7181 0.25
3/31/2017 61.7 76.0605 1.53
4/30/2017 60.35 NaN NaN
To rerun the last entry of each row, you can use list indexing as follows:
for i in my_list:
variable = my_list[i][-1] # returns the last variable of row in list my_list
return variable
You can append the concerned variables to other lists, and use list indexing again to compare.
I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.
My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.
In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.
Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.
How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.
Adding some example data:
item_id date X DUMMY_ROWS
20 2010-11-01 16759 0
2010-12-01 16961 1
2011-01-01 17126 2
2011-02-01 17255 3
2011-03-01 17400 4
2011-04-01 17551 5
21 2007-09-01 4 6
2007-10-01 5 7
2007-11-01 6 8
2007-12-01 10 9
22 2006-05-01 10 10
2006-07-01 13 11
23 2006-05-01 2 12
24 2008-01-01 2 13
2008-02-01 9 14
2008-03-01 18 15
2008-04-01 19 16
2008-05-01 23 17
2008-06-01 32 18
I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.
One option is to reset the index then group by id.
df_new = df.reset_index()
df_new.groupby(['item_id']).X.agg(np.mean)
this leaves your original df intact and gets you the mean across all months for each item id.
For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.
df.sort(['item_id', 'date'], inplace = True)
def sequence_id(item):
item['seq_id'] = range(0,len(item)-1,1)
return item
df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
df_with_seq_id.groupby(['seq_id']).agg(np.mean)
The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.
Hopefully this is more of what you want.
Here's an alternative method for this I finally figured out (which assumes we don't care about the actual dates for the purposes of calculating the mean). Recall the method proposed by #cwharland:
def sequence_id(item):
item['seq'] = range(0,len(item),1)
return item
shrinkWithSeqID_old = df.groupby(level='item_id').apply(sequence_id)
Testing this on a 10,000 row subset of the data frame:
%timeit -n10 dfWithSeqID_old = shrink.groupby(level='item_id').apply(sequence_id)
10 loops, best of 3: 301 ms per loop
It turns out we can simplify things by remembering that pandas' default behavior (i.e. without specifying an index column) is to generate a numeric index for a dataframe numbered from 0 to n (the number of rows in the frame). We can leverage this like so:
dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
The only difference in the output is that we have a new, unlabeled numeric index with the same content as the 'seq' column used in the previous answer, BUT it's almost 4 times faster (I can't compare the methods for the full 13 million row dataframe, as the first methods was resulting in memory errors):
%timeit -n10 dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
10 loops, best of 3: 77.2 ms per loop
Calculating the average as in my original question is only slightly different. The original method was:
dfWithSeqID_old.groupby('seq').agg(np.mean).head()
But now we simply have to account for the fact that we're using the new unlabeled index instead of the 'seq' column:
dfWithSeqID_new.mean(level=1).head()
The result is the same.