I have a pandas dataframe with column of year month data(yyyymm). I am planning on interpolate data to daily & weekly values. Here is my df below.
df:
201301 201302 201303 ... 201709 201710
a 0.747711 0.793101 0.771819 ... 0.818161 0.812522
b 0.776537 0.759745 0.733673 ... 0.757496 0.765181
c 0.801699 0.847655 0.796586 ... 0.784537 0.763551
d 0.797942 0.687899 0.729911 ... 0.819887 0.772395
e 0.777472 0.799676 0.782947 ... 0.804533 0.791759
f 0.780933 0.750774 0.781056 ... 0.790846 0.773705
g 2.071699 2.261739 2.126915 ... 1.891780 2.098914
As you can see that my df is in montly column data and I am hoping to change this to daily values. I am planning on using linear functions. Here is example.
# (201302 - 201301)/31 (since January 2013 has 31 days)
a = (0.793101-0.747711)/31
# now a is the daily increasing (or decresing depends on values) value for a day.
# 2013-01-01 value woud be
0.747711
# 2013-01-02 value woud be
0.747711 + a
# 2013-01-03 value woud be
0.747711 + (a*2)
# last day of January would be
0.747711 + (a*30)
# first day of Feb would be
0.747711 + (a*31) which is 0.793101 (201302 value)
So my df_daily would have every day from 2013 to 2017 Oct(first day) and the values would be just like above. I am very week on working with timestamps so it would be great if there is any way to interpolate my value from month to day values. Thanks!
Oh please let me know if my question is confusing...
First convert columns to datetimes by to_datetime, then reindex for NaNs for missing days and last interpolate:
df.columns = pd.to_datetime(df.columns, format='%Y%m')
#by first and last values of columns
rng = pd.date_range(df.columns[0], df.columns[-1])
#alternatively min and max of columns
#rng = pd.date_range(df.columns.min(), df.columns.max())
df = df.reindex(rng, axis=1).interpolate(axis=1)
Verify solution:
a = (0.793101-0.747711)/31
print (0.747711 + a)
print (0.747711 + a*2)
print (0.747711 + a*3)
0.7491751935483871
0.7506393870967742
0.7521035806451613
print (df)
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 \
a 0.747711 0.749175 0.750639 0.752104 0.753568 0.755032
b 0.776537 0.775995 0.775454 0.774912 0.774370 0.773829
c 0.801699 0.803181 0.804664 0.806146 0.807629 0.809111
d 0.797942 0.794392 0.790842 0.787293 0.783743 0.780193
e 0.777472 0.778188 0.778905 0.779621 0.780337 0.781053
f 0.780933 0.779960 0.778987 0.778014 0.777042 0.776069
g 2.071699 2.077829 2.083960 2.090090 2.096220 2.102351
2013-01-07 2013-01-08 2013-01-09 2013-01-10 ... 2017-09-22 \
a 0.756496 0.757960 0.759425 0.760889 ... 0.814214
b 0.773287 0.772745 0.772204 0.771662 ... 0.762876
c 0.810594 0.812076 0.813559 0.815041 ... 0.769847
d 0.776643 0.773094 0.769544 0.765994 ... 0.786643
e 0.781770 0.782486 0.783202 0.783918 ... 0.795591
f 0.775096 0.774123 0.773150 0.772177 ... 0.778847
g 2.108481 2.114611 2.120742 2.126872 ... 2.036774
2017-09-23 2017-09-24 2017-09-25 2017-09-26 2017-09-27 2017-09-28 \
a 0.814026 0.813838 0.813650 0.813462 0.813274 0.813086
b 0.763132 0.763388 0.763644 0.763900 0.764156 0.764413
c 0.769147 0.768448 0.767748 0.767049 0.766349 0.765650
d 0.785060 0.783476 0.781893 0.780310 0.778727 0.777144
e 0.795165 0.794740 0.794314 0.793888 0.793462 0.793036
f 0.778276 0.777705 0.777133 0.776562 0.775990 0.775419
g 2.043678 2.050583 2.057487 2.064392 2.071296 2.078201
2017-09-29 2017-09-30 2017-10-01
a 0.812898 0.812710 0.812522
b 0.764669 0.764925 0.765181
c 0.764950 0.764251 0.763551
d 0.775561 0.773978 0.772395
e 0.792611 0.792185 0.791759
f 0.774848 0.774276 0.773705
g 2.085105 2.092010 2.098914
[7 rows x 1735 columns]
Related
Having trouble finding a solution for my problem.
Supose I have a df:
df = pd.DataFrame({'col':np.random.randn(len(date_rng)),'created_at':pd.date_range('2020-01-01', '2020-12-31', freq='D')})
df
output is:
col created_at
0 1.764052 2020-01-01
1 0.400157 2020-01-02
2 0.978738 2020-01-03
3 2.240893 2020-01-04
4 1.867558 2020-01-05
... ... ...
361 0.003771 2020-12-27
362 0.931848 2020-12-28
363 0.339965 2020-12-29
364 -0.015682 2020-12-30
365 0.160928 2020-12-31
so problem is that I want to filter the dataframe to show the last 6 months to the first of the month. For example if today (October 23 2020), I want the dataframe to bring results from April 1st.
when in November, first date result should be May 1st regardless of the November date.
Any ideas of how to do it?
This is supposed to run automatically, so something like:
df = df[(df.created_at.dt.month >= datetime.datetime.utcnow().month)
& (df.created_at.dt.year==datetime.datetime.utcnow().year)]
wont work.
Thanks!!!
You can use between to sepcify a condition between two values:
today = datetime.today()
target = today - timedelta(days=180)
df = df[lambda x: x['created_at'].between(datetime(target.year,target.month,1),today)]
I've got a pandas dataframe organized by date I'm trying to split up by year (in a column called 'year'). I want to return one dataframe per year, with a name something like "df19XX".
I was hoping to write a "For" loop that can handle this... something like...
for d in [1980, 1981, 1982]:
df(d) = df[df['year']==d]
... which would return three data frames called df1980, df1981 and df1982.
thanks!
Something like this ? Also using #Andy's df
variables = locals()
for i in [2012, 2013]:
variables["df{0}".format(i)]=df.loc[df.date.dt.year==i]
df2012
Out[118]:
A date
0 0.881468 2012-12-28
1 0.237672 2012-12-29
2 0.992287 2012-12-30
3 0.194288 2012-12-31
df2013
Out[119]:
A date
4 0.151854 2013-01-01
5 0.855312 2013-01-02
6 0.534075 2013-01-03
You can iterate through the groupby:
In [11]: df = pd.DataFrame({"date": pd.date_range("2012-12-28", "2013-01-03"), "A": np.random.rand(7)})
In [12]: df
Out[12]:
A date
0 0.434715 2012-12-28
1 0.208877 2012-12-29
2 0.912897 2012-12-30
3 0.226368 2012-12-31
4 0.100489 2013-01-01
5 0.474088 2013-01-02
6 0.348368 2013-01-03
In [13]: g = df.groupby(df.date.dt.year)
In [14]: for k, v in g:
...: print(k)
...: print(v)
...: print()
...:
2012
A date
0 0.434715 2012-12-28
1 0.208877 2012-12-29
2 0.912897 2012-12-30
3 0.226368 2012-12-31
2013
A date
4 0.100489 2013-01-01
5 0.474088 2013-01-02
6 0.348368 2013-01-03
I would strongly argue that is preferable to simply have a dict having variables and messing around with the locals() dictionary (I claim using locals() so is not "pythonic"):
In [14]: {k: grp for k, grp in g}
Out[14]:
{2012: A date
0 0.434715 2012-12-28
1 0.208877 2012-12-29
2 0.912897 2012-12-30
3 0.226368 2012-12-31, 2013: A date
4 0.100489 2013-01-01
5 0.474088 2013-01-02
6 0.348368 2013-01-03}
Though you might consider calculating this on the fly (rather than storing in a dict or indeed a variable). You can use get_group:
In [15]: g.get_group(2012)
Out[15]:
A date
0 0.865239 2012-12-28
1 0.019071 2012-12-29
2 0.362088 2012-12-30
3 0.031861 2012-12-31
I have this DataFrame:
dft2 = pd.DataFrame(np.random.randn(20, 1),
columns=['A'],
index=pd.MultiIndex.from_product([pd.date_range('20130101',
periods=10,
freq='4M'),
['a', 'b']]))
That looks like this when I print it.
Output:
A
2013-01-31 a 0.275921
b 1.336497
2013-05-31 a 1.040245
b 0.716865
2013-09-30 a -2.697420
b -1.570267
2014-01-31 a 1.326194
b -0.209718
2014-05-31 a -1.030777
b 0.401654
2014-09-30 a 1.138958
b -1.162370
2015-01-31 a 1.770279
b 0.606219
2015-05-31 a -0.819126
b -0.967827
2015-09-30 a -1.423667
b 0.894103
2016-01-31 a 1.765187
b -0.334844
How do I select filter by rows that are the min of that year? Like 2013-01-31, 2014-01-31?
Thanks.
# Create dataframe from the dates in the first level of the index.
df = pd.DataFrame(dft2.index.get_level_values(0), columns=['date'], index=dft2.index)
# Add a `year` column that gets the year of each date.
df = df.assign(year=[d.year for d in df['date']])
# Find the minimum date of each year by grouping.
min_annual_dates = df.groupby('year')['date'].min().tolist()
# Filter the original dataframe based on these minimum dates by year.
>>> dft2.loc[(min_annual_dates, slice(None)), :]
A
2013-01-31 a 1.087274
b 1.488553
2014-01-31 a 0.119801
b 0.922468
2015-01-31 a -0.262440
b 0.642201
2016-01-31 a 1.144664
b 0.410701
Or you can try using isin
dft1=dft2.reset_index()
dft1['Year']=dft1.level_0.dt.year
dft1=dft1.groupby('Year')['level_0'].min()
dft2[dft2.index.get_level_values(0).isin(dft1.values)]
Out[2250]:
A
2013-01-31 a -1.072400
b 0.660115
2014-01-31 a -0.134245
b 1.344941
2015-01-31 a 0.176067
b -1.792567
2016-01-31 a 0.033230
b -0.960175
Suppose I have a time series like so:
pd.Series(np.random.rand(20), index=pd.date_range("1990-01-01",periods=20))
1990-01-01 0.018363
1990-01-02 0.288625
1990-01-03 0.460708
1990-01-04 0.663063
1990-01-05 0.434250
1990-01-06 0.504893
1990-01-07 0.587743
1990-01-08 0.412223
1990-01-09 0.604656
1990-01-10 0.960338
1990-01-11 0.606765
1990-01-12 0.110480
1990-01-13 0.671683
1990-01-14 0.178488
1990-01-15 0.458074
1990-01-16 0.219303
1990-01-17 0.172665
1990-01-18 0.429534
1990-01-19 0.505891
1990-01-20 0.242567
Freq: D, dtype: float64
Suppose the event date is on 1990-01-05 and 1990-01-15. I want to subset the data down to a window of length (-2,+2) around the event, but with an added column yielding the relative number of days from the event date (which has value 0):
1990-01-01 0.460708 -2
1990-01-04 0.663063 -1
1990-01-05 0.434250 0
1990-01-06 0.504893 1
1990-01-07 0.587743 2
1990-01-13 0.671683 -2
1990-01-14 0.178488 -1
1990-01-15 0.458074 0
1990-01-16 0.219303 1
1990-01-17 0.172665 2
Freq: D, dtype: float64
This question is related to my previous question here : Event Study in Pandas
Leveraging your previous solution from 'Event Study in Pandas' by #jezrael:
import numpy as np
import pandas as pd
s = pd.Series(np.random.rand(20), index=pd.date_range("1990-01-01",periods=20))
date1 = pd.to_datetime('1990-01-05')
date2 = pd.to_datetime('1990-01-15')
window = 2
dates = [date1, date2]
s1 = pd.concat([s.loc[date - pd.Timedelta(window, unit='d'):
date + pd.Timedelta(window, unit='d')] for date in dates])
Convert to dataframe:
df = s1.to_frame()
df['Offset'] = pd.Series(data=np.arange(-window,window+1).tolist()*len(dates),index=s1.index)
df
I have a csv file that I am trying to import into pandas.
There are two columns of intrest. date and hour and are the first two cols.
E.g.
date,hour,...
10-1-2013,0,
10-1-2013,0,
10-1-2013,0,
10-1-2013,1,
10-1-2013,1,
How do I import using pandas so that that hour and date is combined or is that best done after the initial import?
df = DataFrame.from_csv('bingads.csv', sep=',')
If I do the initial import how do I combine the two as a date and then delete the hour?
Thanks
Define your own date_parser:
In [291]: from dateutil.parser import parse
In [292]: import datetime as dt
In [293]: def date_parser(x):
.....: date, hour = x.split(' ')
.....: return parse(date) + dt.timedelta(0, 3600*int(hour))
In [298]: pd.read_csv('test.csv', parse_dates=[[0,1]], date_parser=date_parser)
Out[298]:
date_hour a b c
0 2013-10-01 00:00:00 1 1 1
1 2013-10-01 00:00:00 2 2 2
2 2013-10-01 00:00:00 3 3 3
3 2013-10-01 01:00:00 4 4 4
4 2013-10-01 01:00:00 5 5 5
Apply read_csv instead of read_clipboard to handle your actual data:
>>> df = pd.read_clipboard(sep=',')
>>> df['date'] = pd.to_datetime(df.date) + pd.to_timedelta(df.hour, unit='D')/24
>>> del df['hour']
>>> df
date ...
0 2013-10-01 00:00:00 NaN
1 2013-10-01 00:00:00 NaN
2 2013-10-01 00:00:00 NaN
3 2013-10-01 01:00:00 NaN
4 2013-10-01 01:00:00 NaN
[5 rows x 2 columns]
Take a look at the parse_dates argument which pandas.read_csv accepts.
You can do something like:
df = pandas.read_csv('some.csv', parse_dates=True)
# in which case pandas will parse all columns where it finds dates
df = pandas.read_csv('some.csv', parse_dates=[i,j,k])
# in which case pandas will parse the i, j and kth columns for dates
Since you are only using the two columns from the cdv file and combining those into one, I would squeeze into a series of datetime objects like so:
import pandas as pd
from StringIO import StringIO
import datetime as dt
txt='''\
date,hour,A,B
10-1-2013,0,1,6
10-1-2013,0,2,7
10-1-2013,0,3,8
10-1-2013,1,4,9
10-1-2013,1,5,10'''
def date_parser(date, hour):
dates=[]
for ed, eh in zip(date, hour):
month, day, year=list(map(int, ed.split('-')))
hour=int(eh)
dates.append(dt.datetime(year, month, day, hour))
return dates
p=pd.read_csv(StringIO(txt), usecols=[0,1],
parse_dates=[[0,1]], date_parser=date_parser, squeeze=True)
print p
Prints:
0 2013-10-01 00:00:00
1 2013-10-01 00:00:00
2 2013-10-01 00:00:00
3 2013-10-01 01:00:00
4 2013-10-01 01:00:00
Name: date_hour, dtype: datetime64[ns]