percentile across dataframes, with missing values - python

I have several pandas dataframes (say a normal python list) which look like the following two. Note that there can be (in fact there are) some missing values at random dates. I need to compute percentiles of TMAX and/or TMAX_ANOM across the several dataframes, for each date, ignoring the missing values.
YYYY MM DD TMAX TMAX_ANOM
0 1980 7 1 13.0 2.333333
1 1980 7 2 14.3 2.566667
2 1980 7 3 15.6 2.800000
3 1980 7 4 16.9 3.033333
4 1980 8 1 18.2 3.266667
5 1980 8 2 19.5 3.500000
6 1980 8 3 20.8 3.733333
7 1980 8 4 22.1 3.966667
8 1981 7 1 10.0 -0.666667
9 1981 7 2 11.0 -0.733333
10 1981 7 3 12.0 -0.800000
11 1981 7 4 13.0 -0.866667
12 1981 8 1 14.0 -0.933333
13 1981 8 2 15.0 -1.000000
14 1981 8 3 16.0 -1.066667
15 1981 8 4 17.0 -1.133333
16 1982 7 1 9.0 -1.666667
17 1982 7 2 9.9 -1.833333
18 1982 7 3 10.8 -2.000000
19 1982 7 4 11.7 -2.166667
20 1982 8 1 12.6 -2.333333
21 1982 8 2 13.5 -2.500000
22 1982 8 3 14.4 -2.666667
23 1982 8 4 15.3 -2.833333
YYYY MM DD TMAX TMAX_ANOM
0 1980 7 1 14.0 3.666667
1 1980 7 2 15.4 4.033333
2 1980 7 3 16.8 4.400000
3 1980 7 4 18.2 4.766667
4 1980 8 1 19.6 5.133333
6 1980 8 3 22.4 5.866667
7 1980 8 4 23.8 6.233333
8 1981 7 1 10.0 -0.333333
9 1981 7 2 11.0 -0.366667
10 1981 7 3 12.0 -0.400000
11 1981 7 4 13.0 -0.433333
12 1981 8 1 14.0 -0.466667
13 1981 8 2 15.0 -0.500000
14 1981 8 3 16.0 -0.533333
15 1981 8 4 17.0 -0.566667
16 1982 7 1 7.0 -3.333333
17 1982 7 2 7.7 -3.666667
18 1982 7 3 8.4 -4.000000
19 1982 7 4 9.1 -4.333333
20 1982 8 1 9.8 -4.666667
21 1982 8 2 10.5 -5.000000
23 1982 8 4 11.9 -5.666667
So just to be clear, in this example with just two dataframe (and supposing the percentile is median to simplify the discussion), as a output I need a dataframe with 24 elements, the same YYYY/MM/DD fields, and the TMAX (and/or TMAX_ANOM) replaced as follow: for 1980/7/1 it must be the median between 13 and 14, for for 1980/7/2 it must be the median between 14.3 and 15.4 and so on. When there are missing values (for example the 1980/8/2 in the second dataframe here), the median must be computed just from the remaining dataframes -- so in this case the value would just be 19.5
I have not been able to find a clean way to accomplish this, with either numpy or pandas. Any suggestions or should I just resort to manual looping?

#dates as indexes
df1.index = pd.to_datetime(dict(year = df1.YYYY, month = df1.MM, day = df1.DD))
df2.index = pd.to_datetime(dict(year = df2.YYYY, month = df2.MM, day = df2.DD))
#binding useful columns
new_df = df1[['TMAX','TMAX_ANOM']].join(df2[['TMAX','TMAX_ANOM']], lsuffix = '_df1', rsuffix = '_df2')
#calculating quantile
new_df['TMAX_quantile'] = new_df[['TMAX_df1', 'TMAX_df2']].quantile(0.5, axis = 1)

Related

Take row means of every other column in pandas (python)

I am trying to take row means of every few columns. Here is a sample dataset.
d = {'2000-01': range(0,10), '2000-02': range(10,20), '2000-03': range(10,20),
'2001-01': range(10,20), '2001-02':range(5,15), '2001-03':range(5,15)}
pd.DataFrame(data=d)
2000-01 2000-02 2000-03 2001-01 2001-02 2001-03
0 0 10 10 10 5 5
1 1 11 11 11 6 6
2 2 12 12 12 7 7
3 3 13 13 13 8 8
4 4 14 14 14 9 9
5 5 15 15 15 10 10
6 6 16 16 16 11 11
7 7 17 17 17 12 12
8 8 18 18 18 13 13
9 9 19 19 19 14 14
I need to take row means of the first three columns and then the next three and so on in the complete dataset. I don't need the original columns in the new dataset. Here is my code. It works but with caveats (discussed below). I am searching for a cleaner, more elegant solution if possible. (New to Python/Pandas)
#Create empty list to store row means
d1 = []
#Run loop to find row means for every three columns
for i in np.arange(0, 6, 3):
data1 = d.iloc[:,i:i+3]
d1.append(data1.mean(axis=1))
#Create empty list to concat DFs later
dlist1 =[]
#Concat DFs
for j in range(0,len(d1)):
dlist1.append(pd.Series(d1[j]).to_frame())
pd.concat(dlist1, axis = 1)
I get this output, which is correct:
0 0
0 6.666667 6.666667
1 7.666667 7.666667
2 8.666667 8.666667
3 9.666667 9.666667
4 10.666667 10.666667
5 11.666667 11.666667
6 12.666667 12.666667
7 13.666667 13.666667
8 14.666667 14.666667
9 15.666667 15.666667
The columns names can easily be fixed, but the problem is that I need them in a specific format and I have 65 of these columns in the actual dataset. If you'll notice the column names in the original dataset, they are '2000-01'; '2000-02'; '2000-03'. The 1,2 and 3 are months of the year 2000, therefore column 1 of the new df should be '2000q1' , q1 being quarter 1. How do I loop over column names to create this for all my new columns? This seems significantly more challenging (at least to me!) than what's shown here. Thanks for your time!
EDIT: Ok this has been solved, quick shoutout to everyone who contributed!
We have groupby for axis=1, here using the numpy array get the divisor
df=df.groupby(np.arange(df.shape[1])//3,axis=1).mean()
0 1
0 6.666667 6.666667
1 7.666667 7.666667
2 8.666667 8.666667
3 9.666667 9.666667
4 10.666667 10.666667
5 11.666667 11.666667
6 12.666667 12.666667
7 13.666667 13.666667
8 14.666667 14.666667
9 15.666667 15.666667
#np.arange(df.shape[1])//3
#array([0, 0, 0, 1, 1, 1])
More common way
df.columns=pd.to_datetime(df.columns,format='%Y-%m').to_period('Q')
df=df.groupby(level=0,axis=1).mean()
2000Q1 2001Q1
0 6.666667 6.666667
1 7.666667 7.666667
2 8.666667 8.666667
3 9.666667 9.666667
4 10.666667 10.666667
5 11.666667 11.666667
6 12.666667 12.666667
7 13.666667 13.666667
8 14.666667 14.666667
9 15.666667 15.666667
Iterate with multiple of 3 and concat all the series:
df = (pd.concat([df.iloc[:, i:i+3].mean(1).rename(df.columns[i].split('-')[0]+'q1')
for i in range(0, df.shape[1], 3)], axis=1))
print(df)
2000q1 2001q1
0 6.666667 6.666667
1 7.666667 7.666667
2 8.666667 8.666667
3 9.666667 9.666667
4 10.666667 10.666667
5 11.666667 11.666667
6 12.666667 12.666667
7 13.666667 13.666667
8 14.666667 14.666667
9 15.666667 15.666667

Adding column in pandas based on values from other columns with conditions

I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN

Pandas Dataframe: shift/merge multiple rows sharing the same column values into one row

Sorry for any possible confusion with the title. I will describe my question better with the following code and pictures.
Now I have a dataframe with multiple columns. The first two columns, by which they are sorted, 'Route' and 'ID' (Sorry about the formatting, all the rows here have 'Route' value of '100' and 'ID' from 1 to 3.
df1.head(9)
Route ID Year Vol Truck_Vol Truck_%
0 100 1 2017.0 7016 635.0 9.1
1 100 1 2014.0 6835 NaN NaN
2 100 1 2011.0 5959 352.0 5.9
3 100 2 2018.0 15828 NaN NaN
4 100 2 2015.0 13114 2964.0 22.6
5 100 2 2009.0 11844 1280.0 10.8
6 100 3 2016.0 15434 NaN NaN
7 100 3 2013.0 18699 2015.0 10.8
8 100 3 2010.0 15903 NaN NaN
What I want to have is
Route ID Year Vol1 Truck_Vol1 Truck_%1 Year2 Vol2 Truck_Vol2 Truck_%2 Year3 Vol3 Truck_Vol3 Truck_%3
0 100 1 2017 7016 635.0 9.1 2014 6835 NaN NaN 2011 5959 352.0 5.9
1 100 2 2018 15828 NaN NaN 2015 13114 2964.0 22.6 2009 11844 1280.0 10.8
2 100 3 2016 15434 NaN NaN 2013 18699 2015.0 10.8 2010 15903 NaN NaN
Again, sorry for the messy formatting. Let me try a simplified version.
Input:
Route ID Year Vol T_%
0 100 1 2017 100 1.0
1 100 1 2014 200 NaN
2 100 1 2011 300 2.0
3 100 2 2018 400 NaN
4 100 2 2015 500 3.0
5 100 2 2009 600 4.0
Desired Output:
Route ID Year Vol T_% Year.1 Vol.1 T_%.1 Year.2 Vol.2 T_%.2
0 100 1 2017 100 1.0 2014 200 NaN 2011 300 2
1 100 2 2018 400 NaN 2015 500 3.0 2009 600 4
So basically just move the cells shown in the picture
I am stumped here. The names for the newly generated columns don't matter.
For this current dataframe, I have three rows per 'group' like shown in the code. It will be great if the answer can accommodate any number of rows each group.
Thanks for your time.
with groupby + cumcount + set_index + unstack
df1 = df.assign(cid = df.groupby(['Route', 'ID']).cumcount()).set_index(['Route', 'ID', 'cid']).unstack(-1).sort_index(1,1)
df1.columns = [f'{x}{y}' for x,y in df1.columns]
df1 = df1.reset_index()
Output df1:
Route ID T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
0 100 1 1.0 100 2017 NaN 200 2014 2.0 300 2011
1 100 2 NaN 400 2018 3.0 500 2015 4.0 600 2009
melt + pivot_table
v = df.melt(id_vars=['Route', 'ID'])
v['variable'] += v.groupby(['Route', 'ID', 'variable']).cumcount().astype(str)
res = v.pivot_table(index=['Route', 'ID'], columns='variable', values='value')
variable T_% 0 T_% 1 T_% 2 Vol 0 Vol 1 Vol 2 Year 0 Year 1 Year 2
Route ID
100 1 1.0 NaN 2.0 100.0 200.0 300.0 2017.0 2014.0 2011.0
2 NaN 3.0 4.0 400.0 500.0 600.0 2018.0 2015.0 2009.0
If you want to sort these:
c = res.columns.str.extract(r'(\d+)')[0].values.astype(int)
res.iloc[:,np.argsort(c)]
variable T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
Route ID
100 1 1.0 100.0 2017.0 NaN 200.0 2014.0 2.0 300.0 2011.0
2 NaN 400.0 2018.0 3.0 500.0 2015.0 4.0 600.0 2009.0
You asked about why I used cumcount. To explain, here is what v looks like from above:
Route ID variable value
0 100 1 Year 2017.0
1 100 1 Year 2014.0
2 100 1 Year 2011.0
3 100 2 Year 2018.0
4 100 2 Year 2015.0
5 100 2 Year 2009.0
6 100 1 Vol 100.0
7 100 1 Vol 200.0
8 100 1 Vol 300.0
9 100 2 Vol 400.0
10 100 2 Vol 500.0
11 100 2 Vol 600.0
12 100 1 T_% 1.0
13 100 1 T_% NaN
14 100 1 T_% 2.0
15 100 2 T_% NaN
16 100 2 T_% 3.0
17 100 2 T_% 4.0
If I used pivot_table on this DataFrame, you would end up with something like this:
variable T_% Vol Year
Route ID
100 1 1.5 200.0 2014.0
2 3.5 500.0 2014.0
Obviously you are losing data here. cumcount is the solution, as it turns the variable series into this:
Route ID variable value
0 100 1 Year0 2017.0
1 100 1 Year1 2014.0
2 100 1 Year2 2011.0
3 100 2 Year0 2018.0
4 100 2 Year1 2015.0
5 100 2 Year2 2009.0
6 100 1 Vol0 100.0
7 100 1 Vol1 200.0
8 100 1 Vol2 300.0
9 100 2 Vol0 400.0
10 100 2 Vol1 500.0
11 100 2 Vol2 600.0
12 100 1 T_%0 1.0
13 100 1 T_%1 NaN
14 100 1 T_%2 2.0
15 100 2 T_%0 NaN
16 100 2 T_%1 3.0
17 100 2 T_%2 4.0
Where you have a count of repeated elements per unique Route and ID.

Inserting Index along with column values from one dataframe to another

If i have two data frames df1 and df2:
df1
yr
24 1984
30 1985
df2
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
I would like to have a dataframe that gives an output as below, could you help with the function that could help me achieve this
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
24 NaN NaN 1984
30 NaN NaN 1985
You are looking to concat two dataframes:
res = pd.concat([df2, df1], sort=False)
print(res)
d m yr
16 12.0 4.0 2012
17 13.0 10.0 1976
18 24.0 4.0 98
24 NaN NaN 1984
30 NaN NaN 1985

Rolling Mean with Groupby object in Pandas returns null

I have this initial DataFrame in Pandas
A B C D E
0 23 2015 1 14937 16.25
1 23 2015 1 19054 7.50
2 23 2015 2 14937 16.75
3 23 2015 2 19054 17.25
4 23 2015 3 14937 71.75
5 23 2015 3 19054 15.00
6 23 2015 4 14937 13.00
7 23 2015 4 19054 37.75
8 23 2015 5 14937 4.25
9 23 2015 5 19054 18.25
10 23 2015 6 14937 16.50
11 23 2015 6 19054 1.00
I create a Groupby Object because I would like to obtain a roolling mean group by columns A,B,C,D
DfGby = Df.groupby(['A','B', 'C','D'])
After I execute rolling mean
DfMean = pd.DataFrame(DfGby.rolling(center=False,window=3)['E'].mean())
But I obtain
E
A B C D
23 2015 1 14937 0 NaN
19054 1 NaN
2 14937 2 NaN
19054 3 NaN
3 14937 4 NaN
19054 5 NaN
4 14937 6 NaN
19054 7 NaN
5 14937 8 NaN
19054 9 NaN
6 14937 10 NaN
19054 11 NaN
What is the problem here?
If I want to obtain this result, how could I do it?
A B C D E
0 23 2015 1 14937 NaN
1 23 2015 2 14937 NaN
2 23 2015 2 14937 16.6
3 23 2015 1 14937 35.1
4 23 2015 2 14937 33.8
5 23 2015 3 14937 29.7
6 23 2015 4 14937 11.3
7 23 2015 4 19054 NaN
8 23 2015 5 19054 NaN
9 23 2015 5 19054 13.3
10 23 2015 6 19054 23.3
11 23 2015 6 19054 23.7
12 23 2015 6 19054 19.0

Categories