all, I'm newbie to Python and am stuck with the problem below. I have a DF as:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2013-01-02 93
5 2017-02-01 43
For the yearly gaps, say 2012, 2014, 2015, and 2016, I'd like to fill in the gap using the new year date for each of the missing years, and port_id from previous year. Ideally, I'd like:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2012-01-01 76
5 2013-01-02 93
6 2014-01-01 93
7 2015-01-01 93
8 2016-01-01 93
9 2017-02-01 43
I tried multiple approaches but still no avail. Could some expert shed me some lights on how to make it work out? Thanks much in advance!
You can use set.difference with range to find missing dates and then append a dataframe:
# convert to datetime if not already converted
df['asofdate'] = pd.to_datetime(df['asofdate'])
# calculate missing years
years = df['asofdate'].dt.year
missing = set(range(years.min(), years.max())) - set(years)
# append dataframe, sort and front-fill
df = df.append(pd.DataFrame({'asofdate': pd.to_datetime(list(missing), format='%Y')}))\
.sort_values('asofdate')\
.ffill()
print(df)
asofdate port_id
1 2010-01-01 76.0
2 2010-04-01 43.0
3 2011-02-01 76.0
1 2012-01-01 76.0
4 2013-01-02 93.0
2 2014-01-01 93.0
3 2015-01-01 93.0
0 2016-01-01 93.0
5 2017-02-01 43.0
I would create a helper dataframe, containing all the year start dates, then filter out the ones where the years match what is in df, and finally merge them together:
# First make sure it is proper datetime
df['asofdate'] = pd.to_datetime(df.asofdate)
# Create your temporary dataframe of year start dates
helper = pd.DataFrame({'asofdate':pd.date_range(df.asofdate.min(), df.asofdate.max(), freq='YS')})
# Filter out the rows where the year is already in df
helper = helper[~helper.asofdate.dt.year.isin(df.asofdate.dt.year)]
# Merge back in to df, sort, and forward fill
new_df = df.merge(helper, how='outer').sort_values('asofdate').ffill()
>>> new_df
asofdate port_id
0 2010-01-01 76.0
1 2010-04-01 43.0
2 2011-02-01 76.0
5 2012-01-01 76.0
3 2013-01-02 93.0
6 2014-01-01 93.0
7 2015-01-01 93.0
8 2016-01-01 93.0
4 2017-02-01 43.0
Related
I have two generator's generation data which is 15 min time block, I want to convert it hourly. Here is an example:
Time Gen1 Gen2
00:15:00 10 21
00:30:00 12 22
00:45:00 16 26
01:00:00 20 11
01:15:00 60 51
01:30:00 30 31
01:45:00 70 21
02:00:00 40 61
I want to take the average of the first 4 values( basically the 15 min block to the hourly block) and put them in place of a 1-hour block. Expected output:
Time Gen1 Gen2
01:00:00 14.5 20
02:00:00 50 41
I know I can use the pandas' grourpby function to get the expected output but I don't know its proper syntex. So can anyone please help?
Use resample with closed='right'. But first we convert your Time column to datetime type:
df['Time'] = pd.to_datetime(df['Time'])
df.resample('H', on='Time', closed='right').mean().reset_index()
Time Gen1 Gen2
0 2021-01-09 00:00:00 14.5 20.0
1 2021-01-09 01:00:00 50.0 41.0
To convert the Time column back to time format, use:
df['Time'] = df['Time'].dt.time
Time Gen1 Gen2
0 00:00:00 14.5 20.0
1 01:00:00 50.0 41.0
you can try create a column hourand then groupby('hour').mean().
df['date_time'] = pd.to_datetime(df['Time'], format="%H:%M:%S")
df['hour'] = df['date_time'].apply(lambda x: x.strftime("%H:00:00"))
gr_df = df.groupby('hour').mean()
gr_df.index.name = 'Time'
print(gr_df.reset_index())
Time Gen1 Gen2
0 00:00:00 12.666667 23.0
1 01:00:00 45.000000 28.5
2 02:00:00 40.000000 61.0
if I have a dataframe like this:
timestamp price
1596267946298 100.0
1596267946299 101.0
1596267946300 102.0
1596267948301 99.0
1596267948302 98.0
1596267949303 99.0
and I want to create the high, low and average during resampling:
I can duplicate the price column into a high and low column and then during resample do the min, max and mean on the appropriate columns.
But I was wondering if there is a way to make this in one pass?
my expected output would be (let's assume resampling at 100ms for this example)
timestamp price min mean max
1596267946298 100.0 100 100.5 101
1596267946299 101.0 100 100.5 101
1596267946300 102.0 98 99.5 102
1596267948301 99.0 98 99.5 102
1596267948302 98.0 98 99.5 102
1596267949303 99.0 98 995. 102
You could something like this-
import pandas as pd
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
def custom_func(x):
return x[-1], x.min(), x.max(), x.mean()
result = series.resample('3T').apply(custom_func)
print(pd.DataFrame(result.tolist(), columns=['resampled', 'min', 'max', 'mean'], index=result.index))
Before resampling
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
After resampling
resampled min max mean
2000-01-01 00:00:00 2 0 2 1.0
2000-01-01 00:03:00 5 3 5 4.0
2000-01-01 00:06:00 8 6 8 7.0
I want to extract the year from a datetime column into a new 'yyyy'-column AND I want the missing values (NaT) to be displayed as 'NaN', so the datetime-dtype of the new column should be changed I guess but there I'm stuck..
Initial df:
Date ID
0 2016-01-01 12
1 2015-01-01 96
2 NaT 20
3 2018-01-01 73
4 2017-01-01 84
5 NaT 26
6 2013-01-01 87
7 2016-01-01 64
8 2019-01-01 11
9 2014-01-01 34
Desired df:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 NaN
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 NaN
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
Code:
import pandas as pd
import numpy as np
# example df
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"Date": ['2016-01-01', '2015-01-01', np.nan, '2018-01-01', '2017-01-01', np.nan, '2013-01-01', '2016-01-01', '2019-01-01', '2014-01-01']})
df.ID = pd.to_numeric(df.ID)
df.Date = pd.to_datetime(df.Date)
print(df)
#extraction of year from date
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y')
#Try to set NaT to NaN or datetime to numeric, PROBLEM: empty cells keep 'NaT'
df.loc[(df['yyyy'].isna()), 'yyyy'] = np.nan
#(try1)
df.yyyy = df.Date.astype(float)
#(try2)
df.yyyy = pd.to_numeric(df.Date)
#(try3)
print(df)
Use Series.dt.year with converting to integers with Int64:
df.Date = pd.to_datetime(df.Date)
df['yyyy'] = df.Date.dt.year.astype('Int64')
print (df)
ID Date yyyy
0 12 2016-01-01 2016
1 96 2015-01-01 2015
2 20 NaT <NA>
3 73 2018-01-01 2018
4 84 2017-01-01 2017
5 26 NaT <NA>
6 87 2013-01-01 2013
7 64 2016-01-01 2016
8 11 2019-01-01 2019
9 34 2014-01-01 2014
With no convert floats to integers:
df['yyyy'] = df.Date.dt.year
print (df)
ID Date yyyy
0 12 2016-01-01 2016.0
1 96 2015-01-01 2015.0
2 20 NaT NaN
3 73 2018-01-01 2018.0
4 84 2017-01-01 2017.0
5 26 NaT NaN
6 87 2013-01-01 2013.0
7 64 2016-01-01 2016.0
8 11 2019-01-01 2019.0
9 34 2014-01-01 2014.0
Your solution convert NaT to strings NaT, so is possible use replace.
Btw, in last versions of pandas replace is not necessary, it working correctly.
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y').replace('NaT', np.nan)
Isn't it:
df['yyyy'] = df.Date.dt.year
Output:
Date ID yyyy
0 2016-01-01 12 2016.0
1 2015-01-01 96 2015.0
2 NaT 20 NaN
3 2018-01-01 73 2018.0
4 2017-01-01 84 2017.0
5 NaT 26 NaN
6 2013-01-01 87 2013.0
7 2016-01-01 64 2016.0
8 2019-01-01 11 2019.0
9 2014-01-01 34 2014.0
For pandas 0.24.2+, you can use Int64 data type for nullable integers:
df['yyyy'] = df.Date.dt.year.astype('Int64')
which gives:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 <NA>
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 <NA>
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
I have a pandas DataFrame of the form:
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 2014-01-01 00:00:00
I am interested in only the year, month and day in the birth column of the dataframe. I tried to leverage on the Python datetime from pandas but it resulted into an error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1054-02-07 00:00:00
The birth column is an object dtype.
My guess would be that it is an incorrect date. I would not like to pass the parameter errors="coerce" into the to_datetime method, because each item is important and I need just the YYYY-MM-DD.
I tried to leverage on the regex from pandas:
df["birth"].str.find("(\d{4})-(\d{2})-(\d{2})")
But this is returning NANs. How can I resolve this?
Thanks
Because not possible convert to datetimes you can use split by first whitespace and then select first value:
df['birth'] = df['birth'].str.split().str[0]
And then if necessary convert to periods.
Representing out-of-bounds spans.
print (df)
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 0-01-01 00:00:00
def to_per(x):
splitted = x.split('-')
return pd.Period(year=int(splitted[0]),
month=int(splitted[1]),
day=int(splitted[2]), freq='D')
df['birth'] = df['birth'].str.split().str[0].apply(to_per)
print (df)
id amount birth
0 4 78.0 1980-02-02
1 5 24.0 1989-03-03
2 6 49.5 2014-01-01
3 7 34.0 2014-01-01
4 8 49.5 0000-01-01
I have a MultiIndex DataFrame with gappy date values on level 1, like this:
np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in random.sample(pd.date_range('2018-01-01', periods=100, freq='D').tolist(), 5)]
j.sort()
i = pd.MultiIndex.from_tuples(j, names=['Name','Date'])
df = pd.DataFrame(np.random.random_integers(0,100,15), i, columns=['Vals'])
# print(df):
Vals
Name Date
A 2018-01-01 27
2018-01-08 43
2018-03-26 89
2018-03-29 42
2018-04-01 28
B 2018-01-02 79
2018-01-26 60
2018-02-18 45
2018-03-11 37
2018-03-23 92
C 2018-03-17 39
2018-03-20 81
2018-03-21 11
2018-03-27 77
2018-04-08 69
For each level 0 value, I want to fill in the index level 1 with every calendar date between the min and max date values for that level 0. (This Q&A addresses the scenario of filling in level 1 with the same value set for all level 0 values.)
E.g., for subset = df.loc['A'] I want to insert rows so that subset.index.values == pd.date_range(subset.index.values.min(), subset.index.values.max()).values. I.e., the resulting DataFrame would look like:
Vals
Name Date
A 2018-01-01 27
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 NaN
2018-01-06 NaN
2018-01-07 NaN
2018-01-08 43
2018-01-09 NaN
...
Is there a pandaic way to accomplish this?
(The best I can come up with is to inefficiently and iteratively append new DataFrames for each level 0 value. Or similarly iteratively construct a list of index values and then pandas.concat them with the original DataFrame.)
You can use asfreq
df.groupby(level=0).apply(lambda x: x.reset_index(level=0, drop=True).asfreq("D"))