I want to produce a code where it creates an additional table to the dataframe data. The new dataframe data2 will have the following changes:
label will be New instead of Old
col1's last index will be deleted
col2's first index will be deleted
date will be first index will be deleted and all date values will
be subtracted by 1 minute
Then I want to concatenate the two data frames to make one data frame called merge I want to sort the dataframe by dates. Since the first index of data2 is dropped the order of merge should be in order of label: New, Old, New, Old. How can I subtract 1 minute from date_mod and merge the two data frames in order of dates?
import pandas as pd
d = {'col1': [4, 5, 2, 2, 3, 5, 1, 1, 6], 'col2': [6, 2, 1, 7, 3, 5, 3, 3, 9],
'label':['Old','Old','Old','Old','Old','Old','Old','Old','Old'],
'date': ['2022-01-24 10:07:02', '2022-01-27 01:55:03', '2022-01-30 19:09:03', '2022-02-02 14:34:06',
'2022-02-08 12:37:03', '2022-02-10 03:07:02', '2022-02-10 14:02:03', '2022-02-11 00:32:25',
'2022-02-12 21:42:03']}
data = pd.DataFrame(d)
'''
Additional Dataframe
label will have New
'col1'`s last index will be deleted
'col2'`s first index will be deleted
'date' will be first index will be deleted and all date values will be subtracted by 1 minute
'''
a = data['col1'].drop(data['col1'].index[-1])
b = data['col2'].drop(data['col2'].index[0])
# subtract the date_mod by 1 minute
date_mod = pd.to_datetime(data['date'][1:])
data2 = pd.DataFrame({'col1':a,'col2':b,
'label':['New','New','New','New','New','New','New','New'],
'date': date_mod})
'''
Merging data and data2
Sort by 'date'
Should go in order as Old, New, Old, New ...
The length of the columns are 1 less than of data bc of the dropped indexes
'''
merge=pd.merge(data,displayer)
the simplest way I think off, - place all adjustments into the function and apply to the copy of the original dataframe, later simply concat and sort:
data.date = pd.to_datetime(data.date) # converting column date str values to datetime to deduct 1minute later
def adjust_data(df):
df['col1'] = df['col1'].drop(df['col1'].index[-1])
df['col2'] = df['col2'].drop(df['col2'].index[0])
df.date = df.date - pd.Timedelta(minutes=1) # subtract the datetime by 1 minute
df.label = df.label.replace('Old','New') # change values in the column "label"
data2 = data.copy()
adjust_data(data2) # apply function to data2
# concat both dataframes and sort by column "date"
merge = pd.concat([data,data2], axis=0).sort_values(by=['date']).reset_index(drop=True)
print(merge)
out:
col1 col2 label date
0 4.0 NaN New 2022-01-24 10:06:02
1 4.0 6.0 Old 2022-01-24 10:07:02
2 5.0 2.0 New 2022-01-27 01:54:03
3 5.0 2.0 Old 2022-01-27 01:55:03
4 2.0 1.0 New 2022-01-30 19:08:03
5 2.0 1.0 Old 2022-01-30 19:09:03
6 2.0 7.0 New 2022-02-02 14:33:06
7 2.0 7.0 Old 2022-02-02 14:34:06
8 3.0 3.0 New 2022-02-08 12:36:03
9 3.0 3.0 Old 2022-02-08 12:37:03
10 5.0 5.0 New 2022-02-10 03:06:02
11 5.0 5.0 Old 2022-02-10 03:07:02
12 1.0 3.0 New 2022-02-10 14:01:03
13 1.0 3.0 Old 2022-02-10 14:02:03
14 1.0 3.0 New 2022-02-11 00:31:25
15 1.0 3.0 Old 2022-02-11 00:32:25
16 NaN 9.0 New 2022-02-12 21:41:03
17 6.0 9.0 Old 2022-02-12 21:42:03
Related
I am trying to loop through a dataframe creating a dynamic ranges that are limited to the last 6 months of every row index.
Because I am looking back 6 months, I start from the first index row that has a date >= the first date in row index 0 of the dataframe. The condition which I have managed to create is shown below:
for i in df.index:
if datetime.strptime(df['date'][i], '%Y-%m-%d %H:%M:%S') >= (datetime.strptime(df['date'].iloc[0], '%Y-%m-%d %H:%M:%S') + dateutil.relativedelta.relativedelta(months=6)):
However, this merely creates ranges that grow in size incorporating, all data that is indexed after
the first index row that has a date >= the first date in row index 0 of the dataframe.
How can I limit the condition statement to only the last 6 months of each row index?
I'm not sure what exactly you want to do once you have your "dynamic ranges".
You can obtain a list of intervals (t - 6mo, t) for each t in your DatetimeIndex):
intervals = [(t - pd.DateOffset(months=6), t) for t in df.index]
But doing selection operations in a big for-loop might be slow.
Instead, you might be interested in Pandas's rolling operations. It can even use a date offset (as long as it is fixed-frequency) instead of a fixed-sized int window width. However, "6 months" is a non-fixed frequency, and as such the regular rolling won't accept it.
Still, if you are ok with an approximation, say "182 days", then the following might work well.
# setup
n = 10
df = pd.DataFrame(
{'a': np.arange(n), 'b': np.ones(n)},
index=pd.date_range('2019-01-01', freq='M', periods=n))
# example: sum
df.rolling('182D', min_periods=0).sum()
# out:
a b
2019-01-31 0.0 1.0
2019-02-28 1.0 2.0
2019-03-31 3.0 3.0
2019-04-30 6.0 4.0
2019-05-31 10.0 5.0
2019-06-30 15.0 6.0
2019-07-31 21.0 7.0
2019-08-31 27.0 6.0
2019-09-30 33.0 6.0
2019-10-31 39.0 6.0
If you want to be strict on the 6 months windows, you can implement your own pandas.api.indexers.BaseIndexer and use that as arg of rolling.
I have two dataframes which contain data collected at two different frequencies.
I want to update the label of df2, to that of df1 if it falls into the duration of an event.
I created a nested for-loop to do it, but it takes a rather long time.
Here is the code I used:
for i in np.arange(len(df1)-1):
for j in np.arange(len(df2)):
if (df2.timestamp[j] > df1.timestamp[i]) & (df2.timestamp[j] < (df1.timestamp[i] + df1.duration[i])):
df2.loc[j,"label"] = df1.loc[i,"label"]
Is there a more efficient way of doing this?
df1 size (367, 4)
df2 size (342423, 9)
short example data:
import numpy as np
import pandas as pd
data1 = {'timestamp': [1,2,3,4,5,6,7,8,9],
'duration': [0.5,0.3,0.8,0.2,0.4,0.5,0.3,0.7,0.5],
'label': ['inh','exh','inh','exh','inh','exh','inh','exh','inh']
}
df1 = pd.DataFrame (data1, columns = ['timestamp','duration','label'])
data2 = {'timestamp': [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5],
'label': ['plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc']
}
df2 = pd.DataFrame (data2, columns = ['timestamp','label'])
I would first use a merge_asof to select the highest timestamp from df1 below the timestamp from df2. Next a simple (vectorized) comparison of df2.timestamp and df1.timestamp + df1.duration is enough to select matching lines.
Code could be:
df1['t2'] = df1['timestamp'].astype('float64') # types of join columns must be the same
temp = pd.merge_asof(df2, df1, left_on='timestamp', right_on='t2')
df2.loc[temp.timestamp_x <= temp.t2 + temp.duration, 'label'] = temp.label_y
It gives for df2:
timestamp label
0 1.0 inh
1 1.5 inh
2 2.0 exh
3 2.5 plc
4 3.0 inh
5 3.5 inh
6 4.0 exh
7 4.5 plc
8 5.0 inh
9 5.5 plc
10 6.0 exh
11 6.5 exh
12 7.0 inh
13 7.5 plc
14 8.0 exh
15 8.5 exh
16 9.0 inh
17 9.5 inh
I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})
I have two questions:
1) Is there something like pandas groupby but applicable on columns (df.columns, not the data within)?
2) How can I extract the "date" from a datetime object?
I have lots of pandas dataframes (or csv files) that have a position column (that I use as index) and then columns of values measured at each position at different time. The column header is a datetime object (or pd.to_datetime).
I would like to extract data from the same date and save them into a new file.
Here is a simple example of two such dataframes.
df1:
2015-03-13 14:37:00 2015-03-13 14:38:00 2015-03-13 14:38:15 \
0.0 24.49393 24.56345 24.50552
0.5 24.45346 24.54904 24.60773
1.0 24.46216 24.55267 24.74365
1.5 24.55414 24.63812 24.80463
2.0 24.68079 24.76758 24.78552
2.5 24.79236 24.83005 24.72879
3.0 24.83691 24.78308 24.66727
3.5 24.78452 24.73071 24.65085
4.0 24.65857 24.79398 24.72290
4.5 24.56390 24.93515 24.83267
5.0 24.62161 24.96939 24.87366
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
and df2:
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30 \
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
2015-05-23 12:25:00 2015-05-23 12:26:00 2015-05-23 12:26:30
0.0 10.31052 10.132660 10.176910
0.5 10.26834 10.086910 10.252720
1.0 10.27393 10.165890 10.276670
1.5 10.29330 10.219090 10.335910
2.0 10.24432 10.193940 10.406430
2.5 10.11618 10.157470 10.323120
3.0 10.02454 10.110720 10.115360
3.5 10.08716 10.010680 9.997345
4.0 10.23868 9.905670 10.008090
4.5 10.27216 9.879425 9.979645
5.0 10.10693 9.919800 9.870361
df1 has data from 13 March and 19 May, df2 has data from 19 May and 23 May. From these two dataframes containing data from 3 days, I would like to get 3 dataframes (or csv files or any other object), one for each day.
(And for a real-life example, multiply the number of lines, columns and files by some hundred.)
In the worst case I can specify the dates in a separate list, but I am still failing to extract these dates from the dataframes.
I did have an idea of a nested loop
for df in dataframes:
for d in dates:
new_df = df[d]
but I can't get the date from the datetime.
First concat all DataFrames by columns and then convert groupby object by strftime for string keys of dictionary of DataFrames:
df = pd.concat([df1,df2, dfN], axis=1)
dfs = dict(tuple(df.groupby(df.columns.strftime('%Y-%m-%d'), axis=1)))
#select DataFrame
print (dfs['2015-03-13'])
I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)