I have a DataFrame:
pd.DataFrame({"date": ["2018-12-21", "2018-12-22", "2018-05-04"], "price":[100,np.nan, 105]})
Out:
date price
2018-12-21 100.0
2018-12-22 NaN
2018-05-04 105.0
I'm trying to .fillna() by taking the value of Price, of the day before. So in this case, the NaN value will be filled with 100, because we took the date of the NaN value minus one day.
Use:
df = pd.DataFrame({"date": ["2018-12-21", "2018-12-22",
"2018-05-04","2018-05-05",
"2018-05-06","2018-05-09"],
"price":[100,np.nan, 105, np.nan, 108, np.nan]})
print (df)
date price
0 2018-12-21 100.0
1 2018-12-22 NaN
2 2018-05-04 105.0
3 2018-05-05 NaN
4 2018-05-06 108.0
5 2018-05-09 NaN
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df['price'] = df['price'].combine_first(df['price'].shift(1, freq='d'))
#alternative
#df['price'] = df['price'].combine_first(df['price'].shift(1, freq='d'))
print (df)
price
date
2018-12-21 100.0
2018-12-22 100.0
2018-05-04 105.0
2018-05-05 105.0
2018-05-06 108.0
2018-05-09 NaN
If need repalce last non missing value (not day before):
df['price'] = df['price'].ffill()
print (df)
date price
0 2018-12-21 100.0
1 2018-12-22 100.0
2 2018-05-04 105.0
3 2018-05-05 105.0
4 2018-05-06 108.0
5 2018-05-09 108.0
Related
I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0
I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0
I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year
This is my data:
df = pd.DataFrame([
{start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1}
{start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2}
{start_date: '2019/12/01', end_date: '', spend: 10000, campaign_id: 3}
{start_date: '2019/12/01', end_date: '2019/12/01', spend: 50, campaign_id: 4}
]);
I need to add a column to each row for each day since 2019/12/01, and calculate the spend on that campaign that day, which I'll get by dividing the spend on the campaign by the total number of days it was active.
So here I'd add a column for each day between 1 December and today (10 December). For row 1, the content of the five columns for 1 Dec to 5 Dec would be 2000, then for the six ocolumns from 5 Dec to 10 Dec it would be zero.
I know pandas is well-designed for this kind of problem, but I have no idea where to start!
Doesn't seem like a straight forward task to me. But first convert your date columns if you haven't already:
df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])
Then create a helper function for resampling:
def resampler(data, daterange):
temp = (data.set_index('start_date').groupby('campaign_id')
.apply(daterange)
.drop("campaign_id",axis=1)
.reset_index().rename(columns={"level_1":"start_date"}))
return temp
Now its a 3 step process. First resample your data according to end_date of each group:
df1 = resampler(df, lambda d: d.reindex(pd.date_range(min(d.index),max(d["end_date"]),freq="D")) if d["end_date"].notnull().all() else d)
df1["spend"] = df1.groupby("campaign_id")["spend"].transform(lambda x: x.mean()/len(x))
With the average values calculated, resample again to current date:
dates = pd.date_range(min(df["start_date"]),pd.Timestamp.today(),freq="D")
df1 = resampler(df1,lambda d: d.reindex(dates))
Finally transpose your dataframe:
df1 = pd.concat([df1.drop("end_date",axis=1).set_index(["campaign_id","start_date"]).unstack(),
df1.groupby("campaign_id")["end_date"].min()], axis=1)
df1.columns = [*dates,"end_date"]
print (df1)
#
2019-12-01 00:00:00 2019-12-02 00:00:00 2019-12-03 00:00:00 2019-12-04 00:00:00 2019-12-05 00:00:00 2019-12-06 00:00:00 2019-12-07 00:00:00 2019-12-08 00:00:00 2019-12-09 00:00:00 2019-12-10 00:00:00 end_date
campaign_id
1 2000.0 2000.0 2000.0 2000.0 2000.0 NaN NaN NaN NaN NaN 2019-12-05
2 NaN NaN NaN NaN 10000.0 10000.0 10000.0 10000.0 10000.0 NaN 2019-12-09
3 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2019-12-01
I am trying to fill in the NaN's after I upsample my timeseries with resample's pad() function.
I used the resample('1min').asfreq to upsample from hourly data to minute-interval data, then used resample.('1min').pad() it does not fill in the NaN values with the previous value as it should in this Pandas.Dataframe.resample tutorial.
Run to create dataframe with datetime index
url = "https://www.ndbc.noaa.gov/view_text_file.php?filename=42887h2016.txt.gz&dir=data/historical/stdmet/"
data_csv = urlopen(url)
df = pd.read_csv(data_csv, delim_whitespace=True, index_col=0, parse_dates=True)
df.drop(['WDIR', 'WSPD', 'GST', 'WVHT', 'DPD', 'APD', 'MWD', 'PRES', 'VIS', 'TIDE', 'VIS', 'ATMP', 'WTMP'],
axis = 1, inplace = True)
#Data Preparation
df.reset_index(level=0, inplace=True)
df = df.iloc[1:]
df = df.rename(columns={'#YY': 'YY'})
#Create datetime variable
df['Date'] = df[df.columns[0:3]].apply(lambda x: '/'.join(x.dropna().astype(int).astype(str)),axis=1)
df['Time'] = df[df.columns[3:5]].apply(lambda x: ':'.join(x.dropna().astype(int).astype(str)),axis=1)
df['Date.Time'] = df['Date'] + ':' + df['Time']
df['Date'] = pd.to_datetime(df['Date'], format = '%Y/%m/%d')
df['Date.Time'] = pd.to_datetime(df['Date.Time'], format='%Y/%m/%d:%H:%M', utc=True)
#Remaining data prep for the dataframe and create index w/ time date
df = df.convert_objects(convert_numeric=True)
df = df[(df['MM'] == 2.0) | (df['MM'] == 3.0)]
df = df.replace(999, np.nan)
df = df.set_index('Date.Time')
df.drop(['hh', 'mm', 'Time', 'Date'], axis = 1, inplace = True)
The result is the dataframe we want:
YY MM DD DEWP
Date.Time
2016-12-01 00:00:00+00:00 2016 12 1 11.3
2016-12-01 01:00:00+00:00 2016 12 1 9.0
2016-12-01 02:00:00+00:00 2016 12 1 11.0
2016-12-01 03:00:00+00:00 2016 12 1 10.8
2016-12-01 04:00:00+00:00 2016 12 1 6.5
Now resample up to 1 min from an hour
df = df.resample('1min').asfreq()
df.head()
Results:
YY MM DD DEWP
Date.Time
2016-12-01 00:00:00+00:00 2016.0 12.0 1.0 11.3
2016-12-01 00:01:00+00:00 NaN NaN NaN NaN
2016-12-01 00:02:00+00:00 NaN NaN NaN NaN
2016-12-01 00:03:00+00:00 NaN NaN NaN NaN
2016-12-01 00:04:00+00:00 NaN NaN NaN NaN
Fill in NaN values with Pad command
df = df.resample('1min').pad()
df.head()
Results:
YY MM DD DEWP
Date.Time
2016-12-01 00:00:00+00:00 2016.0 12.0 1.0 11.3
2016-12-01 00:01:00+00:00 NaN NaN NaN NaN
2016-12-01 00:02:00+00:00 NaN NaN NaN NaN
2016-12-01 00:03:00+00:00 NaN NaN NaN NaN
2016-12-01 00:04:00+00:00 NaN NaN NaN NaN
Variable DEWP is supposed to look like this
YY MM DD DEWP
Date.Time
2016-12-01 00:00:00+00:00 2016.0 12.0 1.0 11.3
2016-12-01 00:01:00+00:00 2016.0 12.0 1.0 11.3
2016-12-01 00:02:00+00:00 2016.0 12.0 1.0 11.3
2016-12-01 00:03:00+00:00 2016.0 12.0 1.0 11.3
2016-12-01 00:04:00+00:00 2016.0 12.0 1.0 11.3
Any help would be appreciated!
The function df.resample('1min').fillna("pad") worked. Documentation can be found here.