I have a dataframe that looks like the attached picture.
i want to write a function that returns the last entry of each row : 30.35, 76.06, 1.53
i can do this for each line, but not for the entire dataframe:
DataFrame.loc[DataFrame[[('Price', 'A')]].last_valid_index()][[('Price', 'A')]][0]
additionally i want to take the difference of the last two entries per column, and the average of the last two entries per column. The uneven-ness of the dataframe is killing me. also i'm brand new to python.
Price
Security A B C
Date
12/31/2016 60.5 76.0351 0.83
1/31/2017 59.5 75.7433 -0.01
2/28/2017 63.15 75.7181 0.25
3/31/2017 61.7 76.0605 1.53
4/30/2017 60.35 NaN NaN
To rerun the last entry of each row, you can use list indexing as follows:
for i in my_list:
variable = my_list[i][-1] # returns the last variable of row in list my_list
return variable
You can append the concerned variables to other lists, and use list indexing again to compare.
Related
I have a df that looks like this:
Category
Number
Constant
One
141.2
271.01
One
57.4
271.01
One
51.3
271.01
Two
24.69
27.29
Two
12.72
27.29
Two
10.37
27.29
What I want is something that can iterate through each row and calculate a new value of the constant given the previous value of the constant. The resulting dataframe should look something like this:
Category
Number
Constant
One
141.2
129.99
One
57.4
72.59
One
51.3
21.29
Two
24.69
2.6
Two
12.72
-10.12
Two
10.37
-20.49
Update: The calculation is Number-constant for the first calculation and then for the rest would be constant[n-1] - number[n]
Is there a way to do this without using a for loop?
Use a groupby.cumsum to compute the cumulative sum and subtract this from "Constant":
df['Constant'] -= df.groupby('Category')['Number'].cumsum()
Alternatively, if you don't want an in place operation
df['New_Col'] = df['Constant'].sub(df.groupby('Category')['Number'].cumsum())
Output:
Category Number Constant
0 One 141.20 129.81
1 One 57.40 72.41
2 One 51.30 21.11
3 Two 24.69 2.60
4 Two 12.72 -10.12
5 Two 10.37 -20.49
I need to perform the following steps on a data-frame:
Assign a starting value to the "balance" attribute of the first row.
Calculate the "balance" values for the subsequent rows based on value of the previous row using the formula for eg : (previous row balance + 1)
I have tried the following steps:
Created the data-frame:
df = pd.DataFrame(pd.date_range(start = '2019-01-01', end = '2019-12-31'),columns = ['dt_id'])
Created attribute called 'balance':
df["balance"] = 0
Tried to conditionally update the data-frame:
df["balance"] = np.where(df.index == 0, 100, df["balance"].shift(1) + 1)
Results:
From what I can observe, the value is being retrieved for subsequent update before it can be updated in the original data-frame.
The desired output for "balance" attribute :
Row 0 : 100
Row 1: 101
Row 2 : 102
And so on
If I understand correctly if you add this line of code after yours, you are ready:
df["balance"].cumsum()
0 100.0
1 101.0
2 102.0
3 103.0
4 104.0
...
360 460.0
361 461.0
362 462.0
363 463.0
364 464.0
It is a cumulative sum, it sums its value with the previous one and since you have the starting value and then ones it will do what you want.
The problem you have is, that you want to calculate an array and the elements are dependent on each other. So, e.g., element 2 depends on elemen 1 in your array. Element 3 depends on element 2, and so on.
If there is a simple solution, depends on the formula you use, i.e., if you can vectorize it. Here is a good explanation on that topic: Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?
In your case a simple loop should do it:
balance = np.empty(len(df.index))
balance[0] = 100
for i in range(1, len(df.index)):
balance[i] = balance[i-1] + 1 # or whatever formula you want to use
Please note, that above is the general solution. Your formula can be vectorized, thus also be generated using:
balance = 100 + np.arange(0, len(df.index))
I have a dataframe that represents time series probabilities. Each value in column 'Single' represents the probability of that event in that time period (where each row represents one time period). Each value in column 'Cumulative' represents the probability of that event occurring every time period until that point (ie it is the product of every value in 'Single' from time 0 until now).
A simplified version of the dataframe looks like this:
Single Cumulative
0 0.990000 1.000000
1 0.980000 0.990000
2 0.970000 0.970200
3 0.960000 0.941094
4 0.950000 0.903450
5 0.940000 0.858278
6 0.930000 0.806781
7 0.920000 0.750306
8 0.910000 0.690282
9 0.900000 0.628157
10 0.890000 0.565341
In order to calculate the 'Cumulative' column based on the 'Single' column I am looping through the dataframe like this:
for index, row in df.iterrows():
df['Cumulative'][index] = df['Single'][:index].prod()
In reality, there is a lot of data and looping is a drag on performance, is it at all possible to achieve this without looping?
I've tried to find a way to vectorize this calculation or even use the pandas.DataFrame.apply function, but I don't believe I'm able to reference the current index value in either of those methods.
There's a built in function for this in Pandas:
df.cumprod()
I have two tables Table 1 and Table 2. Table 1 has one column and Table 2 has two columns. I am giving below an example of my two tables to further explain what I am trying to do.
TABLE 1 TABLE 2
A B C
0.015 0.000 14.0 #The BINS are 0.00-0.01 = 14.0
0.033 0.025 14.5 # 0.01-0.02 = 14.5
0.042 0.050 15.0 # 0.02-0.03 = 15.0
0.501 0.075 15.5 # 0.03-0.04 = 15.5 AND SO ON
0.505 0.100 16.0
0.520 0.125 16.5
0.350 0.150 17.0
Here if we take BINS in column B, i.e 0.0 to 0.01 and 0.01 to 0.02 and so on.
I would like to select the column A in Table 1, take the first value (0.015) find out the range (BIN) in which it lies (we can see that it lies between 0.000 and 0.025), and I would like to add a second column to table 1, and give it the value 14.5 (second BIN from table 2).
I would like to repeat the same for the second value of table 1, i.e 0.033, we can see it lies between 0.025 and 0.050, so we give it value 15.5 (from table 2). and so on.
The problem is, the only way I know to iterate is using for loops,
for a in A: #takes the values of column A in table 1
But here I don't know how to proceed further. i.e. How do I check which BIN of column B does my column A value lie? so that I can give it the corresponding value from column C
You can iterate through a list using for i, x in enumerate(X). This gives you both the element of the list and the index of that element. You could also use for i in range(len(X)), since in your case you may need to do a look-ahead. Maybe this will work for a solution with arbitrary bin sizes:
A2 = []
for a in A:
for i in range(len(B)-1):
if a < B[i+1]:
A2.append(C[i])
break
else: # We never broke out
A2.append(C[-1])
We compare each element in A to progressively greater elements in B. If the element a is less than the value of a list element in B, then it belongs in the previous bin (i.e. 0.015 from A is less than 0.025 in B and thus belongs in the previous bin). A breakdown, since you asked:
A2 = [] # Make a new list
for a in A: # Do the below once for every element in A
for i in range(len(B)-1):
Instead of iterating directly over B, we're looping through the possible indexes (which start at 0 and end at len(B)-1). However, we're actually going one less than that. If you use range(10), you end up with 0...9. So if you want to iterate over all of B, you can just use range(len(B)). But we actually want to go one less than the full length of B, because in the next step, we're looking ahead.
if a < B[i+1]:
Here we're looking one list index ahead, to see if a is less than the B element at index i+1. If it is, then we want to find the element of C that corresponds to the previous index, i.e. index i. For example, given 0.015 from list A, we look at 0.025 from B. 0.015 < 0.025, so that means 0.015 belongs in the previous bin. That's why we're looking ahead by one.
A2.append(C[i])
break
Grab the element of C that corresponds to index i (no longer looking ahead, since we know i is the correct bin as i+1 is too large) and toss it into A2. Then break out of the inner for loop and start again with the next element of A.
else: # We never broke out
A2.append(C[-1])
This else statement executes if we never break out of the for loop. In this case, a can only possibly be in the final bin, so we just grab the element from C that's at the end of the list (which [-1] will do automatically).
You can just multiply a by 40 and then convert it to int and use that as the index in table 2.
For example, take the first value (0.015) and multiply it by 40 (0.6) and convert it to int (0) and you have the index in table 2 that you want.
D = list()
for a in A:
index = int(a*40)
try:
corresponding_value_from_c = C[index]
except IndexError:
corresponding_value_from_c = C[-1]
D.append(correspondin_value_from_c)
At the end, D will be the column containing all the values that you need.
I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.
My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.
In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.
Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.
How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.
Adding some example data:
item_id date X DUMMY_ROWS
20 2010-11-01 16759 0
2010-12-01 16961 1
2011-01-01 17126 2
2011-02-01 17255 3
2011-03-01 17400 4
2011-04-01 17551 5
21 2007-09-01 4 6
2007-10-01 5 7
2007-11-01 6 8
2007-12-01 10 9
22 2006-05-01 10 10
2006-07-01 13 11
23 2006-05-01 2 12
24 2008-01-01 2 13
2008-02-01 9 14
2008-03-01 18 15
2008-04-01 19 16
2008-05-01 23 17
2008-06-01 32 18
I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.
One option is to reset the index then group by id.
df_new = df.reset_index()
df_new.groupby(['item_id']).X.agg(np.mean)
this leaves your original df intact and gets you the mean across all months for each item id.
For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.
df.sort(['item_id', 'date'], inplace = True)
def sequence_id(item):
item['seq_id'] = range(0,len(item)-1,1)
return item
df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
df_with_seq_id.groupby(['seq_id']).agg(np.mean)
The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.
Hopefully this is more of what you want.
Here's an alternative method for this I finally figured out (which assumes we don't care about the actual dates for the purposes of calculating the mean). Recall the method proposed by #cwharland:
def sequence_id(item):
item['seq'] = range(0,len(item),1)
return item
shrinkWithSeqID_old = df.groupby(level='item_id').apply(sequence_id)
Testing this on a 10,000 row subset of the data frame:
%timeit -n10 dfWithSeqID_old = shrink.groupby(level='item_id').apply(sequence_id)
10 loops, best of 3: 301 ms per loop
It turns out we can simplify things by remembering that pandas' default behavior (i.e. without specifying an index column) is to generate a numeric index for a dataframe numbered from 0 to n (the number of rows in the frame). We can leverage this like so:
dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
The only difference in the output is that we have a new, unlabeled numeric index with the same content as the 'seq' column used in the previous answer, BUT it's almost 4 times faster (I can't compare the methods for the full 13 million row dataframe, as the first methods was resulting in memory errors):
%timeit -n10 dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
10 loops, best of 3: 77.2 ms per loop
Calculating the average as in my original question is only slightly different. The original method was:
dfWithSeqID_old.groupby('seq').agg(np.mean).head()
But now we simply have to account for the fact that we're using the new unlabeled index instead of the 'seq' column:
dfWithSeqID_new.mean(level=1).head()
The result is the same.