selecting rows in a pandas dataframe starting with a certain index value - python

Suppose I have a dataframe, where the rows are indexed by trading days, so something like:
Date ClosingPrice
2017-3-16 10.00
2017-3-17 10.13
2017-3-20 10.19
...
I want to find $N$ rows starting with (say) 2017-2-28, so I don't know the date range, I just know that I want to do something ten rows down. What is the most elegant way of doing this? (there are plenty of ugly ways...)

my quick answer
s = df.Date.searchsorted(pd.to_datetime('2017-2-28'))[0]
df.iloc[s:s + 10]
demo
df = pd.DataFrame(dict(
Date=pd.date_range('2017-01-31', periods=90, freq='B'),
ClosingPrice=np.random.rand(90)
)).iloc[:, ::-1]
date = pd.to_datetime('2017-3-11')
s = df.Date.searchsorted(date)[0]
df.iloc[s:s + 10]
Date ClosingPrice
29 2017-03-13 0.737527
30 2017-03-14 0.411525
31 2017-03-15 0.794309
32 2017-03-16 0.578911
33 2017-03-17 0.747763
34 2017-03-20 0.081113
35 2017-03-21 0.000058
36 2017-03-22 0.274022
37 2017-03-23 0.367831
38 2017-03-24 0.100930
naive time test

df[df['Date'] >= Date(2017,02,28)][:10]
I guess?

Related

time stamp - how to calculate time difference in seconds with a groupby

I have a pandas dataframe with id and date as the 2 columns - the date column has all the way to seconds.
data = {'id':[17,17,17,17,17,18,18,18,18],'date':['2018-01-16','2018-01-26','2018-01-27','2018-02-11',
'2018-03-14','2018-01-28','2018-02-12','2018-02-25','2018-03-04'],
}
df1 = pd.DataFrame(data)
I would like to have a new column - (tslt) - 'time_since_last_transaction'. The first transaction for each unique user_id could be a number say 1. Each subsequent transaction for that user should measure the difference between the 1st time stamp for that user and its current time stamp to generate a time difference in seconds.
I used the datetime and timedelta etc. but did not have too much of luck. Any help would be appreciated.
You can try groupby().transform():
df1['date'] = pd.to_datetime(df1['date'])
df1['diff'] = df1['date'].sub(df1.groupby('id').date.transform('min')).dt.total_seconds()
Output:
id date diff
0 17 2018-01-16 0.0
1 17 2018-01-26 864000.0
2 17 2018-01-27 950400.0
3 17 2018-02-11 2246400.0
4 17 2018-03-14 4924800.0
5 18 2018-01-28 0.0
6 18 2018-02-12 1296000.0
7 18 2018-02-25 2419200.0
8 18 2018-03-04 3024000.0

Calculation in grouped dataframe with date type index

I have a dataset like:
date_time value
30.04.20 9:31 1
30.04.20 10:12 5
30.04.20 15:16 2
01.05.20 12:01 63
01.05.20 13:00 78
02.05.20 7:23 4
02.05.20 17:34 2
02.05.20 18:34 4
02.05.20 21:39 3458
03.05.20 9:34 77
03.05.20 14:54 4
03.05.20 16:54 7
04.05.20 15:24 35
I need to group records within a day and calculate the average over 3 days (day_before-today-next_day) period as follows (desired result):
date value
01.05.2020 3617
02.05.2020 3697
03.05.2020 3591
I wrote the beginning of the code
import pandas as pd
df = pd.read_excel(...)
df['date'] = df['date_time'].dt.normalize()
df.groupby('date').sum()
The grouped dataframe here looks like:
date value
30.04.2020 8
01.05.2020 141
02.05.2020 3468
03.05.2020 88
04.05.2020 35
But I can't go further because I don't understand how to get the desired result in a concise "pandas" way. Please give me some pointers.
You almost have done your work, just add these lines of code to your current solution:
df_group = df.groupby('date').sum()
results = df_group.rolling(window=3, min_periods=3, center=True).sum()
print(results)
2020-04-30 NaN
2020-05-01 3617.0
2020-05-02 3697.0
2020-05-03 3591.0
2020-05-04 NaN
# retain only rows with values
print(results.dropna())
date
2020-05-01 3617.0
2020-05-02 3697.0
2020-05-03 3591.0
Hope this helps!

Vectorized count of daily longest consecutive streak

For evaluating daily longest consecutive runtimes of a power plant, I have to evaluate the longest streak per day, meaning that each day is considered as a separate timeframe.
So let's say I've got the power output in the dataframe df:
df = pd.Series(
data=[
*np.zeros(4), *(np.full(24*5, 19.5) + np.random.rand(24*5)),
*np.zeros(4), *(np.full(8, 19.5) + np.random.rand(8)),
*np.zeros(5), *(np.full(24, 19.5) + np.random.rand(24)),
*np.zeros(27), *(np.full(24, 19.5) + np.random.rand(24))],
index=pd.date_range(start='2019-07-01 00:00:00', periods=9*24, freq='1h'))
And the "cutoff-power" is 1 (everything below that is considered as off). I use this to mask the "on"-values, shift and compare the mask to itself to count the number of consecutive groups. Finally I group the groups by the days of the year in the index and count the daily consecutive values consec_group:
mask = df > 1
groups = mask.ne(mask.shift()).cumsum()
consec_group = groups[mask].groupby(groups[mask].index.date).value_counts()
Which yields:
consec_group
Out[3]:
2019-07-01 2 20
2019-07-02 2 24
2019-07-03 2 24
2019-07-04 2 24
2019-07-05 2 24
2019-07-06 4 8
2 4
6 3
2019-07-07 6 21
2019-07-09 8 24
dtype: int64
But I'd like to have the maximum value of each consecutive daily streak and dates without any runtime should be displayed with zeros, as in 2019-07-08 7 0. See the expected result:
2019-07-01 20
2019-07-02 24
2019-07-03 24
2019-07-04 24
2019-07-05 24
2019-07-06 8
2019-07-07 21
2019-07-08 0
2019-07-09 24
dtype: int64
Any help will be appreciated!
First remove second level by Series.reset_index, filter out second duplicated values by call back with Series.asfreq - it working, because .value_counts sort Series:
consec_group = (consec_group.reset_index(level=1, drop=True)[lambda x: ~x.index.duplicated()]
.asfreq('d', fill_value=0))
print (consec_group)
Or solution with GroupBy.first:
consec_group = (consec_group.groupby(level=0)
.first()
.asfreq('d', fill_value=0))
print (consec_group)
2019-07-01 20
2019-07-02 24
2019-07-03 24
2019-07-04 24
2019-07-05 24
2019-07-06 8
2019-07-07 21
2019-07-08 0
2019-07-09 24
Freq: D, dtype: int64
Ok, I guess I was too close to the finish line to see the answer... Looks like I had already solved the complex part.
So right after posting the question, I tested max with the level=0 argument instead of level=1 and that was the solution:
max_consec_group = consec_group.max(level=0).asfreq('d', fill_value=0)
Thanks at jezrael for the asfreq part!

filtering date column in python

I'm new to python and I'm facing the following problem. I have a dataframe composed of 2 columns, one of them is date (datetime64[ns]). I want to keep all records within the last 12 months. My code is the following:
today=start_time.date()
last_year = today + relativedelta(months = -12)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
when I run it I get the following message:
TypeError: type object 2017-06-05
Any ideas?
last_year seems to bring me the date that I want in the following format: 2017-06-05
Create a time delta object in pandas to increment the date (12 months). Call pandas.Timstamp('now') to get the current date. And then create a date_range. Here is an example for getting monthly data for 12 months.
import pandas as pd
import datetime
list_1 = [i for i in range(0, 12)]
list_2 = [i for i in range(13, 25)]
list_3 = [i for i in range(26, 38)]
data_frame = pd.DataFrame({'A': list_1, 'B': list_2, 'C':list_3}, pd.date_range(pd.Timestamp('now'), pd.Timestamp('now') + pd.Timedelta (weeks=53), freq='M'))
We create a timestamp for the current date and enter that as our start date. Then we create a timedelta to increment that date by 53 weeks (or 52 if you'd like) which gets us 12 months of data. Below is the output:
A B C
2018-06-30 05:05:21.335625 0 13 26
2018-07-31 05:05:21.335625 1 14 27
2018-08-31 05:05:21.335625 2 15 28
2018-09-30 05:05:21.335625 3 16 29
2018-10-31 05:05:21.335625 4 17 30
2018-11-30 05:05:21.335625 5 18 31
2018-12-31 05:05:21.335625 6 19 32
2019-01-31 05:05:21.335625 7 20 33
2019-02-28 05:05:21.335625 8 21 34
2019-03-31 05:05:21.335625 9 22 35
2019-04-30 05:05:21.335625 10 23 36
2019-05-31 05:05:21.335625 11 24 37
Try
today = datetime.datetime.now()
You can use pandas functionality with datetime objects. The syntax is often more intuitive and obviates the need for additional imports.
last_year = pd.to_datetime('today') + pd.DateOffset(years=-1)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
As such, we would need to see all your code to be sure of the reason behind your error; for example, how is start_time defined?

Getting the average of a certain hour on weekdays over several years in a pandas dataframe

I have an hourly dataframe in the following format over several years:
Date/Time Value
01.03.2010 00:00:00 60
01.03.2010 01:00:00 50
01.03.2010 02:00:00 52
01.03.2010 03:00:00 49
.
.
.
31.12.2013 23:00:00 77
I would like to average the data so I can get the average of hour 0, hour 1... hour 23 of each of the years.
So the output should look somehow like this:
Year Hour Avg
2010 00 63
2010 01 55
2010 02 50
.
.
.
2013 22 71
2013 23 80
Does anyone know how to obtain this in pandas?
Note: Now that Series have the dt accessor it's less important that date is the index, though Date/Time still needs to be a datetime64.
Update: You can do the groupby more directly (without the lambda):
In [21]: df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
Out[21]:
Value
Date/Time Date/Time
2010 0 60
1 50
2 52
3 49
In [22]: res = df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
In [23]: res.index.names = ["year", "hour"]
In [24]: res
Out[24]:
Value
year hour
2010 0 60
1 50
2 52
3 49
If it's a datetime64 index you can do:
In [31]: df1.groupby([df1.index.year, df1.index.hour]).mean()
Out[31]:
Value
2010 0 60
1 50
2 52
3 49
Old answer (will be slower):
Assuming Date/Time was the index* you can use a mapping function in the groupby:
In [11]: year_hour_means = df1.groupby(lambda x: (x.year, x.hour)).mean()
In [12]: year_hour_means
Out[12]:
Value
(2010, 0) 60
(2010, 1) 50
(2010, 2) 52
(2010, 3) 49
For a more useful index, you could then create a MultiIndex from the tuples:
In [13]: year_hour_means.index = pd.MultiIndex.from_tuples(year_hour_means.index,
names=['year', 'hour'])
In [14]: year_hour_means
Out[14]:
Value
year hour
2010 0 60
1 50
2 52
3 49
* if not, then first use set_index:
df1 = df.set_index('Date/Time')
If your date/time column were in the datetime format (see dateutil.parser for automatic parsing options), you can use pandas resample as below:
year_hour_means = df.resample('H',how = 'mean')
which will keep your data in the datetime format. This may help you with whatever you are going to be doing with your data down the line.

Categories