How to calculate averages and SEM in a multi-indexed pandas dataframe? - python

I have some data in a pandas dataframe which has a triple multi-index:
Antibody Time Repeats
Customer_Col1A2 0 1 0.657532
2 0.639933
3 0.975302
5 1 0.628196
2 0.663301
3 0.921025
10 1 0.665601
2 0.785324
3 0.697913
My question is, what is the best way to calculate the average and (sample) standard error of the mean for this data (grouped by time point? So the answer for the 0 time point would be (0.657532+0.639933+0.975302)/3=0.757589 for the average and 0.188750216 for the SD. The output would look something like this:
Antibody Time Average sample SD
Customer_Col1A2 0 0.757589 0.188750216
5 .... ....
10 .... ....
Thanks in advance

You can group by the level of multi-index by specifying the level parameter, and calculate the average and SD using DataFrame.mean() and DataFrame.std() methods correspondingly:
df1.groupby(level=[0,1]).agg({'avg': 'mean', 'sd': 'std'})

Related

Take average of window in pandas

I have a large pandas dataframe, I want to average first 12 rows, then next 12 rows and so on. I wrote a for loop for this task
df_list=[]
for i in range(0,len(df),12):
print(i,i+12)
df_list.append(df.iloc[i:i+12].mean())
pd.concat(df_list,1).T
Is there an efficient way to do this without for loop
You can divide the index by N i.e. 12 in your case, then group the dataframe by the quotient, and finally call mean on these groups:
# Random dataframe of shape 120,4
>>> df=pd.DataFrame(np.random.randint(10,100,(120,4)), columns=list('ABCD'))
>>> df.groupby(df.index//12).mean()
A B C D
0 49.416667 52.583333 63.833333 47.833333
1 60.166667 61.666667 53.750000 34.583333
2 49.916667 54.500000 50.583333 64.750000
3 51.333333 51.333333 56.333333 60.916667
4 51.250000 51.166667 50.750000 50.333333
5 56.333333 50.916667 51.416667 59.750000
6 53.750000 57.000000 45.916667 59.250000
7 48.583333 59.750000 49.250000 50.750000
8 53.750000 48.750000 51.583333 68.000000
9 54.916667 48.916667 57.833333 43.333333
I believe you want to split your dataframe to seperate chunks with 12 rows. Then you can use np.arange inside groupby to take the mean of each seperate chunk:
df.groupby(np.arange(len(df)) // 12).mean()

Printing an output using pandas .groupby to include keys that equalled to 0?

I'm trying to get an output that includes every key, even if the is an equivalent value of 0.
import pandas as pd
df = pd.read_csv('climate_data_Dec2017.csv')
wind_direction = df['Direction of maximum wind gust']
is_on_a_specific_day = df['Date'].str.contains("12-26")
specific_day = df[is_on_a_specific_day]
grouped_by_date = specific_day.groupby('Direction of maximum wind gust')
number_record_by_date = grouped_by_date.size()
print(number_record_by_date)
The current output looks like this right now:
E 4
ENE 2
ESE 1
NE 1
NNE 1
NNW 1
SE 3
SSE 3
SW 1
But I'm trying to get it to include other directions too. ie
E 4
ENE 2
ESE 1
N 0
NE 1
NNE 1
NNW 1
NW 0
S 0
SE 3
SSE 3
SW 1
...
Is there any way to get my code to include it? I tried to group it by the wind direction dataframe rather than the specific_day dataframe, but going down that route, I'm stuck on what to do next. Any pointers would be great! Thanks
Probably, you need something like this:
df['is_on_a_specific_day'] = df['Date'].str.contains("12-26")
df.groupy('Direction of maximum wind gust').sum()[['is_on_a_specific_day']]
What you can do is:
Computing a list with all the unique value of the column 'Direction of maximum wind gust' in the original dataset (list_all_dirs = df['Direction of maximum wind gust'].unique())
Filter the dataset and compute the groupby as you said
Append to the result one row for each of the value in the list that is not already there. What you can do is building a series like this:
series_to_append = pd.Series({dir: 0 for dir in list_all_dirs if dir not in number_record_by_date.index}, name='Direction of maximum wind gust') and then append it to the series you already computed at the previous step.
Eleonora

Get the daily percentages of values that fall within certain ranges

I have a large dataset of test results where I have a columns to represent the date a test was completed and number of hours it took to complete the test i.e.
df = pd.DataFrame({'Completed':['21/03/2020','22/03/2020','21/03/2020','24/03/2020','24/03/2020',], 'Hours_taken':[23,32,8,73,41]})
I have a months worth of test data and the tests can take anywhere from a couple of hours to a couple of days. I want to try and work out, for each day, what percentage of tests fall within the ranges of 24hrs/48hrs/72hrs ect. to complete, up to the percentage of tests that took longer than a week.
I've been able to work it out generally without taking the dates into account like so:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48)
Lab_tests['GreaterThanWeek'] = Lab_tests['hours'] >168
one = Lab_tests['1-day'].value_counts().loc[True]
two = Lab_tests['two-day'].value_counts().loc[True]
eight = Lab_tests['GreaterThanWeek'].value_counts().loc[True]
print(one/10407 * 100)
print(two/10407 * 100)
print(eight/10407 * 100)
Ideally I'd like to represent the percentages in another dataset where the rows represent the dates and the columns represent the data ranges. But I can't work out how to take what I've done and modify it to get these percentages for each date. Is this possible to do in pandas?
This question, Counting qualitative values based on the date range in Pandas is quite similar but the fact that I'm counting the occurrences in specified ranges is throwing me off and I haven't been able to get a solution out of it.
Bonus Question
I'm sure you've noticed my current code is not the most elegant thing in the world, is the a cleaner way to do what I've done above, as I'm doing that for every data range that I want?
Edit:
So the Output for the sample data given would look like so:
df = pd.DataFrame({'1-day':[100,0,0,0], '2-day':[0,100,0,50],'3-day':[0,0,0,0],'4-day':[0,0,0,50]},index=['21/03/2020','22/03/2020','23/03/2020','24/03/2020'])
You're almost there. You just need to do a few final steps:
First, cast your bools to ints, so that you can sum them.
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Completed hours one-day two-day GreaterThanWeek
0 21/03/2020 23 1 0 0
1 22/03/2020 32 0 1 0
2 21/03/2020 8 1 0 0
3 24/03/2020 73 0 0 0
4 24/03/2020 41 0 1 0
Then, drop the hours column and roll the rest up to the level of Completed:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
one-day two-day GreaterThanWeek
Completed
21/03/2020 2 0 0
22/03/2020 0 1 0
24/03/2020 0 1 0
EDIT: To get to percent, you just need to divide each column by the sum of all three. You can sum columns by defining the axis of the sum:
...
daily_totals = Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
daily_totals.sum(axis=1)
Completed
21/03/2020 2
22/03/2020 1
24/03/2020 1
dtype: int64
Then divide the daily totals dataframe by the column-wise sum of the daily totals (again, we use axis to define whether each value of the series will be the divisor for a row or a column.):
daily_totals.div(daily_totals.sum(axis=1), axis=0)
one-day two-day GreaterThanWeek
Completed
21/03/2020 1.0 0.0 0.0
22/03/2020 0.0 1.0 0.0
24/03/2020 0.0 1.0 0.0

Iteratively fit regression line on subset of pandas DataFrame - vectorized solution?

I am trying to aggregate pandas DataFrame and create 2 new columns that would be a slope and an intercept from a simple linear regression fit.
The dummy dataset looks like this:
CustomerID Month Value
a 1 10
a 2 20
a 3 20
b 1 30
b 2 40
c 1 80
c 2 90
And I want the output to look like this - which would regress Value against Month for each CustomerID:
CustomerID Slope Intercept
a 0.30 10
b 0.20 30
c 0.12 80
I know I could run a loop and then for each customerID run the linear regression model, but my dataset is huge and I need a vectorized approach. I tried using groupby and apply by passing linear regression function but didn't find a solution that would work.
Thanks in advance!
By using scpiy with groupby , here I am using for loop rather than apply , since apply is slower than for loop
from scipy import stats
pd.DataFrame.from_dict({y:stats.linregress(x['Month'],x['Value'])[:2] for y, x in df.groupby('CustomerID')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
Out[798]:
Slope Intercept
a 5.0 6.666667
b 10.0 20.000000
c 10.0 70.000000

Pandas dataframe division returning 'inf' when dividing by non-zero

When I tried to create a new column in my pandas dataframe by dividing an existing column by another existing column, I am getting 'inf' in rows where there is no division by zero.
claims_report['% COST DIFFERENCE'] = 100*claims_report['COST DIFFERENCE']/claims_data['ORIGINAL UNIT COST']
print(claims_report[['ORIGINAL UNIT COST','COST DIFFERENCE','% COST DIFFERENCE']].head(9))
The result of the above code is:
ORIGINAL UNIT COST COST DIFFERENCE % COST DIFFERENCE
0 4.3732 11.2500 257.248697
1 3.7935 22.0000 579.939370
2 6.9167 22.0000 318.070756
3 1.1429 4.5000 393.735235
4 0.0000 7.3269 inf
5 7.3269 -0.8622 -11.767596
6 6.4647 0.7853 12.147509
7 0.2590 0.0170 6.563707
8 14.4471 -12.7145 -inf
By my calculations, there should not be a -inf in row 8. As a check I ran the following code:
for i in range(9):
print(i, claims_report['COST DIFFERENCE'][i], claims_report['ORIGINAL UNIT COST'][i], claims_report['COST DIFFERENCE'][i]/claims_report['ORIGINAL UNIT COST'][i])
Which gives me the expected result in row 8:
0 11.25 4.3732 2.5724869660660388
1 22.0 3.7935 5.799393699749571
2 22.0 6.9167 3.180707562855119
3 4.5 1.1429 3.937352349286902
4 7.3269 0.0 inf
5 -0.8622 7.3269 -0.11767596118412971
6 0.7853 6.4647 0.1214750877844293
7 0.017 0.259 0.06563706563706564
8 -12.7145 14.4471 -0.880072817382035
Anyone familiar with this type of issue?
In your first line
claims_report['% COST DIFFERENCE'] = 100*claims_report['COST DIFFERENCE']/claims_data['ORIGINAL UNIT COST']
Didn't you mean "claims_report" instead of "claims_data"? Maybe you're just selecting the wrong dataframe?
Another solution in the future may be to do:
import pandas as pd
pd.set_option('use_inf_as_na', True)
which sets any values in your pandas dataframe from 'inf' to 'nan'. Then you can use the fillna method like this:
df = df.fillna(value=0, inplace=True)

Categories