I have a pandas dataframe containing several years of timeseries data as columns. Each starts in November and ends in the subsequent year. I'm trying to deal with NaN's in non-leap years. The structure can be recreated with something like this:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
ndays = 151
sdates = [(datetime(2019,11,1) + timedelta(days=x)).strftime("%b-%d") for x in range(ndays)]
columns=list(range(2016,2021))
df = pd.DataFrame(np.random.randint(0,100,size=(ndays, len(columns))), index=sdates, columns=columns)
df.loc["Feb-29",2017:2019] = np.nan
df.loc["Feb-28":"Mar-01"]
Out[61]:
2016 2017 2018 2019 2020
Feb-28 36 59.0 74.0 19.0 24
Feb-29 85 NaN NaN NaN 6
Mar-01 24 75.0 49.0 99.0 82
What I want to do is remove the "Feb-29" NaN data only (in the non-leap years) and then shift teh data in those columns up a row, leaving the leap-years as-is. Something like this, with Mar-01 and subsequent rows shifted up for 2017 through 2019:
2016 2017 2018 2019 2020
Feb-28 36 59.0 74.0 19.0 24
Feb-29 85 75.0 49.0 99.0 6
Mar-01 24 42.0 21.0 41.0 82
I don't care that "Mar-01" data will be labelled as "Feb-29" as eventually I'll be replacing the string date index with an integer index.
Note that I didn't include this in the example but I have NaN's at the start and end of the dataframe in varying rows that I do not want to remove (i.e., I can't just remove all NaN data, I need to target "Feb-29" specifically)
It sounds like you don't actually want to shift dates up, but rather number them correctly based on the day of the year? If so, this will work:
First, make the DataFrame long instead of wide:
df = pd.DataFrame(
{
"2016": {"Feb-28": 36, "Feb-29": 85, "Mar-01": 24},
"2017": {"Feb-28": 59.0, "Feb-29": None, "Mar-01": 75.0},
"2018": {"Feb-28": 74.0, "Feb-29": None, "Mar-01": 49.0},
"2019": {"Feb-28": 19.0, "Feb-29": None, "Mar-01": 99.0},
"2020": {"Feb-28": 24, "Feb-29": 6, "Mar-01": 82},
}
)
df = (
df.melt(ignore_index=False, var_name="year", value_name="value")
.reset_index()
.rename(columns={"index": "month-day"})
)
df
month-day year value
0 Feb-28 2016 36.0
1 Feb-29 2016 85.0
2 Mar-01 2016 24.0
3 Feb-28 2017 59.0
4 Feb-29 2017 NaN
5 Mar-01 2017 75.0
6 Feb-28 2018 74.0
7 Feb-29 2018 NaN
8 Mar-01 2018 49.0
9 Feb-28 2019 19.0
10 Feb-29 2019 NaN
11 Mar-01 2019 99.0
12 Feb-28 2020 24.0
13 Feb-29 2020 6.0
14 Mar-01 2020 82.0
Then remove rows containing an invalid date and get the day of the year for remaining days:
df["date"] = pd.to_datetime(
df.apply(lambda row: " ".join(row[["year", "month-day"]]), axis=1), errors="coerce",
)
df = df[df["date"].notna()]
df["day_of_year"] = df["date"].dt.dayofyear
df
month-day year value date day_of_year
0 Feb-28 2016 36.0 2016-02-28 59
1 Feb-29 2016 85.0 2016-02-29 60
2 Mar-01 2016 24.0 2016-03-01 61
3 Feb-28 2017 59.0 2017-02-28 59
5 Mar-01 2017 75.0 2017-03-01 60
6 Feb-28 2018 74.0 2018-02-28 59
8 Mar-01 2018 49.0 2018-03-01 60
9 Feb-28 2019 19.0 2019-02-28 59
11 Mar-01 2019 99.0 2019-03-01 60
12 Feb-28 2020 24.0 2020-02-28 59
13 Feb-29 2020 6.0 2020-02-29 60
14 Mar-01 2020 82.0 2020-03-01 61
I think this would do the trick?
is_leap = ~np.isnan(df.loc["Feb-29"])
leap = df.loc[:, is_leap]
nonleap = df.loc[:, ~is_leap]
nonleap = nonleap.drop(index="Feb-29")
df2 = pd.concat([
leap.reset_index(drop=True),
nonleap.reset_index(drop=True)
], axis=1).sort_index(axis=1)
This should move all the nulls to the end by sorting on pd.isnull, which I believe it what you want:
df = df.apply(lambda x: sorted(x, key=pd.isnull),axis=0)
Before:
2016 2017 2018 2019 2020
Feb-28 35 85.0 88.0 46.0 19
Feb-29 41 NaN NaN NaN 52
Mar-01 86 92.0 29.0 44.0 36
After:
2016 2017 2018 2019 2020
Feb-28 35 85.0 88.0 46.0 19
Feb-29 41 92.0 29.0 44.0 52
Mar-01 86 32.0 50.0 27.0 36
I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
If i have two data frames df1 and df2:
df1
yr
24 1984
30 1985
df2
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
I would like to have a dataframe that gives an output as below, could you help with the function that could help me achieve this
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
24 NaN NaN 1984
30 NaN NaN 1985
You are looking to concat two dataframes:
res = pd.concat([df2, df1], sort=False)
print(res)
d m yr
16 12.0 4.0 2012
17 13.0 10.0 1976
18 24.0 4.0 98
24 NaN NaN 1984
30 NaN NaN 1985
I have this initial DataFrame in Pandas
A B C D E
0 23 2015 1 14937 16.25
1 23 2015 1 19054 7.50
2 23 2015 2 14937 16.75
3 23 2015 2 19054 17.25
4 23 2015 3 14937 71.75
5 23 2015 3 19054 15.00
6 23 2015 4 14937 13.00
7 23 2015 4 19054 37.75
8 23 2015 5 14937 4.25
9 23 2015 5 19054 18.25
10 23 2015 6 14937 16.50
11 23 2015 6 19054 1.00
I create a Groupby Object because I would like to obtain a roolling mean group by columns A,B,C,D
DfGby = Df.groupby(['A','B', 'C','D'])
After I execute rolling mean
DfMean = pd.DataFrame(DfGby.rolling(center=False,window=3)['E'].mean())
But I obtain
E
A B C D
23 2015 1 14937 0 NaN
19054 1 NaN
2 14937 2 NaN
19054 3 NaN
3 14937 4 NaN
19054 5 NaN
4 14937 6 NaN
19054 7 NaN
5 14937 8 NaN
19054 9 NaN
6 14937 10 NaN
19054 11 NaN
What is the problem here?
If I want to obtain this result, how could I do it?
A B C D E
0 23 2015 1 14937 NaN
1 23 2015 2 14937 NaN
2 23 2015 2 14937 16.6
3 23 2015 1 14937 35.1
4 23 2015 2 14937 33.8
5 23 2015 3 14937 29.7
6 23 2015 4 14937 11.3
7 23 2015 4 19054 NaN
8 23 2015 5 19054 NaN
9 23 2015 5 19054 13.3
10 23 2015 6 19054 23.3
11 23 2015 6 19054 23.7
12 23 2015 6 19054 19.0