Indexing by row counts in a pandas dataframe - python

I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.
My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.
In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.
Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.
How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.
Adding some example data:
item_id date X DUMMY_ROWS
20 2010-11-01 16759 0
2010-12-01 16961 1
2011-01-01 17126 2
2011-02-01 17255 3
2011-03-01 17400 4
2011-04-01 17551 5
21 2007-09-01 4 6
2007-10-01 5 7
2007-11-01 6 8
2007-12-01 10 9
22 2006-05-01 10 10
2006-07-01 13 11
23 2006-05-01 2 12
24 2008-01-01 2 13
2008-02-01 9 14
2008-03-01 18 15
2008-04-01 19 16
2008-05-01 23 17
2008-06-01 32 18
I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.

One option is to reset the index then group by id.
df_new = df.reset_index()
df_new.groupby(['item_id']).X.agg(np.mean)
this leaves your original df intact and gets you the mean across all months for each item id.
For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.
df.sort(['item_id', 'date'], inplace = True)
def sequence_id(item):
item['seq_id'] = range(0,len(item)-1,1)
return item
df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
df_with_seq_id.groupby(['seq_id']).agg(np.mean)
The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.
Hopefully this is more of what you want.

Here's an alternative method for this I finally figured out (which assumes we don't care about the actual dates for the purposes of calculating the mean). Recall the method proposed by #cwharland:
def sequence_id(item):
item['seq'] = range(0,len(item),1)
return item
shrinkWithSeqID_old = df.groupby(level='item_id').apply(sequence_id)
Testing this on a 10,000 row subset of the data frame:
%timeit -n10 dfWithSeqID_old = shrink.groupby(level='item_id').apply(sequence_id)
10 loops, best of 3: 301 ms per loop
It turns out we can simplify things by remembering that pandas' default behavior (i.e. without specifying an index column) is to generate a numeric index for a dataframe numbered from 0 to n (the number of rows in the frame). We can leverage this like so:
dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
The only difference in the output is that we have a new, unlabeled numeric index with the same content as the 'seq' column used in the previous answer, BUT it's almost 4 times faster (I can't compare the methods for the full 13 million row dataframe, as the first methods was resulting in memory errors):
%timeit -n10 dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
10 loops, best of 3: 77.2 ms per loop
Calculating the average as in my original question is only slightly different. The original method was:
dfWithSeqID_old.groupby('seq').agg(np.mean).head()
But now we simply have to account for the fact that we're using the new unlabeled index instead of the 'seq' column:
dfWithSeqID_new.mean(level=1).head()
The result is the same.

Related

Printing the whole row of my data from a max value in a column

I am trying to select the highest value from this data but i also need the month it comes from too, here printing the whole row. Currently i'm using df.max() which just pulls the highest value. Does anyone know how to do this in pandas.
#current code
accidents["month"] = accidents.Date.apply(lambda s: int(s.split("/")[1]))
temp = accidents.groupby('month').size().rename('Accidents')
#selecting the highest value from the dataframe
temp.max()
answer given = 10937
answer i need should look like this (month and no of accidents): 11 10937
temp dataframe;
month
1 9371
2 8838
3 9427
4 8899
5 9758
6 9942
7 10325
8 9534
9 10222
10 10311
11 10937
12 9972
Name: Accidents, dtype: int64
would also be good to rename the accidents column to accidents is anyone can help too. Thanks
If the value is unique (in your case it is) you can simply get a subset of the dataframe.
temp[temp.iloc[:,1]==temp.iloc[:,1].max()]
So what the code is doing is looking at the integer position (rows then columns) and matching it with your condition, which is the max temp.

Wrangling shifted DataFrame with Pandas

In the following pandas DataFrame, The first two columns (Remessas_A and Remessas_A_1d) were given and I had to find the third (previsao) following the pattern described below. Notice that I'm not counting the column DataEntrega as the first, which is a datetime index.
DataEntrega,Remessas_A,Remessas_A_1d,previsao
2020-07-25,696.0,,
2020-07-26,0.0,,
2020-07-27,518.0,,
2020-07-28,629.0,,
2020-07-29,699.0,,
2020-07-30,660.0,,
2020-07-31,712.0,,
2020-08-01,2.0,-672.348684948797,23.651315051203028
2020-08-02,0.0,-504.2138715410994,-504.2138715410994
2020-08-03,4.0,-91.10009092298037,426.89990907701963
2020-08-04,327.0,194.46620611760167,823.4662061176017
2020-08-05,442.0,220.65451760630847,919.6545176063084
2020-08-06,474.0,-886.140302693952,-226.14030269395198
2020-08-07,506.0,-61.28132269808316,650.7186773019168
2020-08-08,11.0,207.12286256242962,230.77417761363265
2020-08-09,2.0,109.36137834671834,-394.85249319438105
2020-08-10,388.0,146.2428764085755,573.1427854855951
2020-08-11,523.0,-193.02046115081606,630.4457449667857
2020-08-12,509.0,-358.59415822684485,561.0603593794635
2020-08-13,624.0,966.9258406162757,740.7855379223237
2020-08-14,560.0,175.8273195122506,826.5459968141674
2020-08-15,70.0,19.337299248463978,250.11147686209662
2020-08-16,3.0,83.09413535361391,-311.75835784076713
2020-08-17,401.0,-84.67345026550751,488.4693352200876
2020-08-18,526.0,158.53310638454195,788.9788513513276
2020-08-19,580.0,285.99137337700336,847.0517327564669
2020-08-20,624.0,-480.93226226400344,259.85327565832023
2020-08-21,603.0,-194.68412031046182,631.8618765037056
2020-08-22,45.0,-39.23172496101115,210.87975190108546
2020-08-23,2.0,-115.26376570266325,-427.0221235434304
2020-08-24,463.0,10.04635376084557,498.5156889809332
2020-08-25,496.0,-32.44638720124206,756.5324641500856
2020-08-26,600.0,-198.6715680014182,648.3801647550487
2020-08-27,663.0,210.40991269713578,470.263188355456
2020-08-28,628.0,40.32391720053602,672.1857937042416
2020-08-29,380.0,-2.4418918145294626,208.437860086556
2020-08-30,0.0,152.66166068424076,-274.3604628591896
2020-08-31,407.0,18.499558564880928,517.0152475458141
The first 7 values of Remessas_A_1d and previsao are nulls, and will be kept nulls.
In order to obtain the first 7 non nulls values of previsao, from 2020-08-01 to 2020-08-07, I've made a shift of the Remessas_A 7 days ahead and I've added the rows of the shifted Remessas_A and the original Remessas_A_1d:
#res is the name of the dataframe
res['previsao'].loc['2020-08-01':'2020-08-07'] = res['Remessas_A'].shift(7).loc['2020-08-01':'2020-08-07'].add(res['Remessas_A_1d'].loc['2020-08-01':'2020-08-07'])
To find the next 7 values of previsao, from 2020-08-08 to 2020-08-14, now I shifted the previsao column 7 days ahead and I've added the rows of the shifted previsao and the original previsao:
res['previsao'].loc['2020-08-08':'2020-08-14'] = res['previsao'].shift(7).loc['2020-08-08':'2020-08-14'].add(res['Remessas_A_1d'].loc['2020-08-08':'2020-08-14'])
To find the next values of previsao, I repeated the last step, moving 7 days ahead each time:
res['previsao'].loc['2020-08-15':'2020-08-21'] = res['previsao'].shift(7).loc['2020-08-15':'2020-08-21'].add(res['Remessas_A_1d'].loc['2020-08-15':'2020-08-21'])
res['previsao'].loc['2020-08-22':'2020-08-28'] = res['previsao'].shift(7).loc['2020-08-22':'2020-08-28'].add(res['Remessas_A_1d'].loc['2020-08-22':'2020-08-28'])
res['previsao'].loc['2020-08-29':'2020-08-31'] = res['previsao'].shift(7).loc['2020-08-29':'2020-08-31'].add(res['Remessas_A_1d'].loc['2020-08-29':'2020-08-31'])
#the last line only spaned 3 days because I reached the end of my dataframe
Instead of doing that by hand, how can I create a function that would take periods=7, Remessas_A and Remessas_A_1d as input and would give previsao as the output?
Not the most elegant code, but this should do the trick:
df["previsao"][df.index <= pd.to_datetime("2020-08-07")] = df["Remessas_A"].shift(7) + df["Remessas_A_1d"]
for d in pd.date_range("2020-08-08", "2020-08-31"):
df.loc[d, "previsao"] = df.loc[d - pd.Timedelta("7d"), "previsao"] + df.loc[d, "Remessas_A_1d"]
Edit: I've assumed you have DataEntrega as an index and datetime object. Can post the rest of the code if you need.

pandas data frame, apply t-test on rows simultaneously grouping by column names (have duplicates!)

I have a data frame with particular readouts as indexes (different types of measurements for a given sample), each column is a sample for which these readouts were taken. I also have a treatment group assigned as the column name for each sample. You can see the example below.
What I need to do: for a given readout (row) group samples by treatment (column name) and perform a t-test (Welch's t-test) on each group (each treatment). T-test must be done as a comparison with one fixed treatment (control treatment). I do not care about tracking out the sample ids (it was required, now I dropped them on purpose), I'm not going to do paired tests.
For example here, for readout1 I need to compare treatment1 vs treatment3, treatment2 vs treatment3 (it's ok if I'll also get treatment3 vs treatment3).
Example of data frame:
frame = pd.DataFrame(np.arange(27).reshape((3, 9)),
index=['readout1', 'readout2', 'readout3'],
columns=['treatment1', 'treatment1', 'treatment1',\
'treatment2', 'treatment2', 'treatment2', \
'treatment3', 'treatment3', 'treatment3'])
frame
Out[757]:
treatment1 treatment1 ... treatment3 treatment3
readout1 0 1 ... 7 8
readout2 9 10 ... 16 17
readout3 18 19 ... 25 26
[3 rows x 9 columns]
I'm fighting it for several days now. Tried to unstack/stack the data, transposing the data frame, then grouping by index, removing nan and applying lambda. Tried other strategies but none worked. Will appreciate any help.
thank you!

Calculate a value in Pandas that is based on a product of past values without looping

I have a dataframe that represents time series probabilities. Each value in column 'Single' represents the probability of that event in that time period (where each row represents one time period). Each value in column 'Cumulative' represents the probability of that event occurring every time period until that point (ie it is the product of every value in 'Single' from time 0 until now).
A simplified version of the dataframe looks like this:
Single Cumulative
0 0.990000 1.000000
1 0.980000 0.990000
2 0.970000 0.970200
3 0.960000 0.941094
4 0.950000 0.903450
5 0.940000 0.858278
6 0.930000 0.806781
7 0.920000 0.750306
8 0.910000 0.690282
9 0.900000 0.628157
10 0.890000 0.565341
In order to calculate the 'Cumulative' column based on the 'Single' column I am looping through the dataframe like this:
for index, row in df.iterrows():
df['Cumulative'][index] = df['Single'][:index].prod()
In reality, there is a lot of data and looping is a drag on performance, is it at all possible to achieve this without looping?
I've tried to find a way to vectorize this calculation or even use the pandas.DataFrame.apply function, but I don't believe I'm able to reference the current index value in either of those methods.
There's a built in function for this in Pandas:
df.cumprod()

retrieve Last and 3rd to last row in uneven dataframe

I have a dataframe that looks like the attached picture.
i want to write a function that returns the last entry of each row : 30.35, 76.06, 1.53
i can do this for each line, but not for the entire dataframe:
DataFrame.loc[DataFrame[[('Price', 'A')]].last_valid_index()][[('Price', 'A')]][0]
additionally i want to take the difference of the last two entries per column, and the average of the last two entries per column. The uneven-ness of the dataframe is killing me. also i'm brand new to python.
Price
Security A B C
Date
12/31/2016 60.5 76.0351 0.83
1/31/2017 59.5 75.7433 -0.01
2/28/2017 63.15 75.7181 0.25
3/31/2017 61.7 76.0605 1.53
4/30/2017 60.35 NaN NaN
To rerun the last entry of each row, you can use list indexing as follows:
for i in my_list:
variable = my_list[i][-1] # returns the last variable of row in list my_list
return variable
You can append the concerned variables to other lists, and use list indexing again to compare.

Categories