I have this DataFrame:
dft2 = pd.DataFrame(np.random.randn(20, 1),
columns=['A'],
index=pd.MultiIndex.from_product([pd.date_range('20130101',
periods=10,
freq='4M'),
['a', 'b']]))
That looks like this when I print it.
Output:
A
2013-01-31 a 0.275921
b 1.336497
2013-05-31 a 1.040245
b 0.716865
2013-09-30 a -2.697420
b -1.570267
2014-01-31 a 1.326194
b -0.209718
2014-05-31 a -1.030777
b 0.401654
2014-09-30 a 1.138958
b -1.162370
2015-01-31 a 1.770279
b 0.606219
2015-05-31 a -0.819126
b -0.967827
2015-09-30 a -1.423667
b 0.894103
2016-01-31 a 1.765187
b -0.334844
How do I select filter by rows that are the min of that year? Like 2013-01-31, 2014-01-31?
Thanks.
# Create dataframe from the dates in the first level of the index.
df = pd.DataFrame(dft2.index.get_level_values(0), columns=['date'], index=dft2.index)
# Add a `year` column that gets the year of each date.
df = df.assign(year=[d.year for d in df['date']])
# Find the minimum date of each year by grouping.
min_annual_dates = df.groupby('year')['date'].min().tolist()
# Filter the original dataframe based on these minimum dates by year.
>>> dft2.loc[(min_annual_dates, slice(None)), :]
A
2013-01-31 a 1.087274
b 1.488553
2014-01-31 a 0.119801
b 0.922468
2015-01-31 a -0.262440
b 0.642201
2016-01-31 a 1.144664
b 0.410701
Or you can try using isin
dft1=dft2.reset_index()
dft1['Year']=dft1.level_0.dt.year
dft1=dft1.groupby('Year')['level_0'].min()
dft2[dft2.index.get_level_values(0).isin(dft1.values)]
Out[2250]:
A
2013-01-31 a -1.072400
b 0.660115
2014-01-31 a -0.134245
b 1.344941
2015-01-31 a 0.176067
b -1.792567
2016-01-31 a 0.033230
b -0.960175
Related
I have data that is in this inconvenient format. Simple reproducible example below:
26/9/21 26/9/21
10:00 Paul
12:00 John
27/9/21 27/9/21
1:00 Ringo
As you can see, the dates have not been entered as a column. Instead, the dates repeat across rows as a "header" row for the rows below it. Each date then has a variable number of data rows beneath it, before the next date "header" row.
The output I would like would be:
26/9/21 10:00 Paul
26/9/21 12:00 John
27/9/21 1:00 Ringo
How can I do this in Python and Pandas?
Code for data entry below:
import pandas as pd
df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'],
'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})
df
Convert your column a to datetime with errors='coerce' then fill forward. Now you can add the time offset rows.
sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
msk = sra.isnull()
sra = sra.ffill() + pd.to_timedelta(df.loc[msk, 'a'] + ':00')
out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)
>>> out
a b
1 2021-09-26 10:00:00 John
2 2021-09-26 12:00:00 Paul
4 2021-09-27 01:00:00 Ringo
Step by step:
>>> sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
0 2021-09-26
1 NaT
2 NaT
3 2021-09-27
4 NaT
Name: a, dtype: datetime64[ns]
>>> msk = sra.isnull()
0 False
1 True
2 True
3 False
4 True
Name: a, dtype: bool
>>> sra = sra.ffill() + pd.to_timedelta(df.loc[msk, 'a'] + ':00')
0 NaT
1 2021-09-26 10:00:00
2 2021-09-26 12:00:00
3 NaT
4 2021-09-27 01:00:00
Name: a, dtype: datetime64[ns]
>>> out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)
a b
1 2021-09-26 10:00:00 John
2 2021-09-26 12:00:00 Paul
4 2021-09-27 01:00:00 Ringo
Following is simple to understand code, reading original dataframe row by row and creating a new dataframe:
df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'],
'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})
dflen = len(df)
newrow = []; newdata = []
for i in range(dflen): # read each row one by one
if '/' in df.iloc[i,0]: # if date found
item0 = df.iloc[i,0] # get new date
newrow = [item0] # put date as first entry of new row
continue # go to next row
newrow.append(df.iloc[i,0]) # add time
newrow.append(df.iloc[i,1]) # add name
newdata.append(newrow) # add row to new data
newrow = [item0] # create new row with same date entry
newdf = pd.DataFrame(newdata, columns=['Date','Time','Name']) # create new dataframe;
print(newdf)
Output:
Date Time Name
0 26/9/21 10:00 Paul
1 26/9/21 12:00 John
2 27/9/21 1:00 Ringo
I need to perform a merge to map a new set of ids to an old set of ids. My starting data looks like this:
lst = [10001, 20001, 30001]
dt = pd.date_range(start='2016', end='2018', freq='M')
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
In [94]: df.head()
Out[94]:
0
date id
2016-01-31 10001 -0.512371
20001 -1.164461
30001 -1.253232
2016-02-29 10001 -0.129874
20001 0.711938
And I want to map id to newid using data that looks like this:
df1 = pd.DataFrame({'id': [10001, 10001, 10001, 10001],
'start_date': ['2015-11-31', '2016-02-01', '2016-05-16', '2017-02-16'],
'end_date': ['2016-01-31', '2016-05-15', '2017-02-15', '2018-04-02'],
'new_id': ['ABC123', 'XYZ789', 'HIJ456', 'LMN654']},)
df2 = pd.DataFrame({'id': [20001, 20001, 20001, 20001],
'start_date': ['2015-10-07', '2016-01-08', '2016-06-02', '2017-02-13'],
'end_date': ['2016-01-07', '2016-06-01', '2017-02-12', '2018-03-017'],
'new_id': ['CBA321', 'ZYX987', 'JIH765', 'NML345']},)
df3 = pd.DataFrame({'id': [30001, 30001, 30001, 30001],
'start_date': ['2015-07-31', '2016-02-23', '2016-06-17', '2017-05-12'],
'end_date': ['2016-02-22', '2016-06-16', '2017-05-11', '2018-01-05'],
'new_id': ['CCC333', 'XXX444', 'HHH888', 'III888']},)
df_ranges = pd.concat([df1,df2,df3])
In [95]: df_ranges.head()
Out[95]:
index end_date id new_id start_date
0 0 2016-01-31 10001 ABC123 2015-11-31
1 1 2016-05-15 10001 XYZ789 2016-02-01
2 2 2017-02-15 10001 HIJ456 2016-05-16
3 3 2018-04-02 10001 LMN654 2017-02-16
4 0 2016-01-07 20001 CBA321 2015-10-07
Basically, my data is monthly panel data and the new data has ranges of dates for which a specific mapping from A->B is valid. So row 1 of the mapping data says that from 2016-01-31 through 2015-211-31 the id 10001 maps to ABC123.
I've previously done this in SAS/SQL with a statement like this:
SELECT a.*, b.newid FROM df as a, df_ranges as b
WHERE a.id = b.id AND b.start_date <= a.date < b.end_date
A few notes about the data:
it should be a 1:1 mapping of id to newid.
the date ranges are non-overlapping
The solution here may be a good start: Merging dataframes based on date range
It is exactly what I'm looking for except that it merges only on dates, not additionally on id. I played with groupby() and this solution but didn't find a way to make it work. Another idea I had was to unstack() the mapping data (df_ranges) to match the dimensions/time frequency of df but this seems to simply re-state the existing problem.
Perhaps I got downvoted because this was too easy, but I couldn't find the answer anywhere so I'll just post it here: you should use the merge_asof() which provides fuzzy matching on dates.
First, data need to be sorted:
df_ranges.sort_values(by=['start_date','id'],inplace=True)
df.sort_values(by=['date','id'],inplace=True)
Then, do the merge:
pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date')
Output:
In [30]: pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date').head()
Out[30]:
date id 0 start_date end_date new_id
0 2016-01-31 10001 0.120892 2015-11-30 2016-01-31 ABC123
1 2016-01-31 20001 -0.576096 2016-01-08 2016-06-01 ZYX987
2 2016-01-31 30001 0.543597 2015-07-31 2016-02-22 CCC333
3 2016-02-29 10001 0.316212 2016-02-01 2016-05-15 XYZ789
4 2016-02-29 20001 -0.625878 2016-01-08 2016-06-01 ZYX987
I have a pandas dataframe with column of year month data(yyyymm). I am planning on interpolate data to daily & weekly values. Here is my df below.
df:
201301 201302 201303 ... 201709 201710
a 0.747711 0.793101 0.771819 ... 0.818161 0.812522
b 0.776537 0.759745 0.733673 ... 0.757496 0.765181
c 0.801699 0.847655 0.796586 ... 0.784537 0.763551
d 0.797942 0.687899 0.729911 ... 0.819887 0.772395
e 0.777472 0.799676 0.782947 ... 0.804533 0.791759
f 0.780933 0.750774 0.781056 ... 0.790846 0.773705
g 2.071699 2.261739 2.126915 ... 1.891780 2.098914
As you can see that my df is in montly column data and I am hoping to change this to daily values. I am planning on using linear functions. Here is example.
# (201302 - 201301)/31 (since January 2013 has 31 days)
a = (0.793101-0.747711)/31
# now a is the daily increasing (or decresing depends on values) value for a day.
# 2013-01-01 value woud be
0.747711
# 2013-01-02 value woud be
0.747711 + a
# 2013-01-03 value woud be
0.747711 + (a*2)
# last day of January would be
0.747711 + (a*30)
# first day of Feb would be
0.747711 + (a*31) which is 0.793101 (201302 value)
So my df_daily would have every day from 2013 to 2017 Oct(first day) and the values would be just like above. I am very week on working with timestamps so it would be great if there is any way to interpolate my value from month to day values. Thanks!
Oh please let me know if my question is confusing...
First convert columns to datetimes by to_datetime, then reindex for NaNs for missing days and last interpolate:
df.columns = pd.to_datetime(df.columns, format='%Y%m')
#by first and last values of columns
rng = pd.date_range(df.columns[0], df.columns[-1])
#alternatively min and max of columns
#rng = pd.date_range(df.columns.min(), df.columns.max())
df = df.reindex(rng, axis=1).interpolate(axis=1)
Verify solution:
a = (0.793101-0.747711)/31
print (0.747711 + a)
print (0.747711 + a*2)
print (0.747711 + a*3)
0.7491751935483871
0.7506393870967742
0.7521035806451613
print (df)
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 \
a 0.747711 0.749175 0.750639 0.752104 0.753568 0.755032
b 0.776537 0.775995 0.775454 0.774912 0.774370 0.773829
c 0.801699 0.803181 0.804664 0.806146 0.807629 0.809111
d 0.797942 0.794392 0.790842 0.787293 0.783743 0.780193
e 0.777472 0.778188 0.778905 0.779621 0.780337 0.781053
f 0.780933 0.779960 0.778987 0.778014 0.777042 0.776069
g 2.071699 2.077829 2.083960 2.090090 2.096220 2.102351
2013-01-07 2013-01-08 2013-01-09 2013-01-10 ... 2017-09-22 \
a 0.756496 0.757960 0.759425 0.760889 ... 0.814214
b 0.773287 0.772745 0.772204 0.771662 ... 0.762876
c 0.810594 0.812076 0.813559 0.815041 ... 0.769847
d 0.776643 0.773094 0.769544 0.765994 ... 0.786643
e 0.781770 0.782486 0.783202 0.783918 ... 0.795591
f 0.775096 0.774123 0.773150 0.772177 ... 0.778847
g 2.108481 2.114611 2.120742 2.126872 ... 2.036774
2017-09-23 2017-09-24 2017-09-25 2017-09-26 2017-09-27 2017-09-28 \
a 0.814026 0.813838 0.813650 0.813462 0.813274 0.813086
b 0.763132 0.763388 0.763644 0.763900 0.764156 0.764413
c 0.769147 0.768448 0.767748 0.767049 0.766349 0.765650
d 0.785060 0.783476 0.781893 0.780310 0.778727 0.777144
e 0.795165 0.794740 0.794314 0.793888 0.793462 0.793036
f 0.778276 0.777705 0.777133 0.776562 0.775990 0.775419
g 2.043678 2.050583 2.057487 2.064392 2.071296 2.078201
2017-09-29 2017-09-30 2017-10-01
a 0.812898 0.812710 0.812522
b 0.764669 0.764925 0.765181
c 0.764950 0.764251 0.763551
d 0.775561 0.773978 0.772395
e 0.792611 0.792185 0.791759
f 0.774848 0.774276 0.773705
g 2.085105 2.092010 2.098914
[7 rows x 1735 columns]
I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()
I have a DataFrame Like following.
df = pd.DataFrame({'id' : [1,1,2,3,2],
'value' : ["a","b","a","a","c"], 'Time' : ['6/Nov/2012 23:59:59 -0600','6/Nov/2012 00:00:05 -0600','7/Nov/2012 00:00:09 -0600','27/Nov/2012 00:00:13 -0600','27/Nov/2012 00:00:17 -0600']})
I need to get an output like following.
combined_id | enter time | exit time | time difference
combined_id should be created by grouping 'id' and 'value'
g = df.groupby(['id', 'value'])
Following doesn’t work with grouping by two columns. (How to use first() and last() here as enter and exit times?)
df['enter'] = g.apply(lambda x: x.first())
To get difference would following work?
df['delta'] = (df['exit']-df['enter'].shift()).fillna(0)
First ensure you're column is a proper datetime column:
In [11]: df['Time'] = pd.to_datetime(df['Time'])
Now, you can do the groupby and use agg with the first and last groupby methods:
In [12]: g = df.groupby(['id', 'value'])
In [13]: res = g['Time'].agg({'first': 'first', 'last': 'last'})
In [14]: res = g['Time'].agg({'enter': 'first', 'exit': 'last'})
In [15]: res['time_diff'] = res['exit'] - res['enter']
In [16]: res
Out[16]:
exit enter time_diff
id value
1 a 2012-11-06 23:59:59 2012-11-06 23:59:59 0 days
b 2012-11-06 00:00:05 2012-11-06 00:00:05 0 days
2 a 2012-11-07 00:00:09 2012-11-07 00:00:09 0 days
c 2012-11-27 00:00:17 2012-11-27 00:00:17 0 days
3 a 2012-11-27 00:00:13 2012-11-27 00:00:13 0 days
Note: this is a bit of a boring example since there is only one item in each group...