selecting rows in a pandas dataframe starting with a certain index value

selecting rows in a pandas dataframe starting with a certain index value - python

Suppose I have a dataframe, where the rows are indexed by trading days, so something like:
Date ClosingPrice
2017-3-16 10.00
2017-3-17 10.13
2017-3-20 10.19
...
I want to find $N$ rows starting with (say) 2017-2-28, so I don't know the date range, I just know that I want to do something ten rows down. What is the most elegant way of doing this? (there are plenty of ugly ways...)

my quick answer
s = df.Date.searchsorted(pd.to_datetime('2017-2-28'))[0]
df.iloc[s:s + 10]
demo
df = pd.DataFrame(dict(
Date=pd.date_range('2017-01-31', periods=90, freq='B'),
ClosingPrice=np.random.rand(90)
)).iloc[:, ::-1]
date = pd.to_datetime('2017-3-11')
s = df.Date.searchsorted(date)[0]
df.iloc[s:s + 10]
Date ClosingPrice
29 2017-03-13 0.737527
30 2017-03-14 0.411525
31 2017-03-15 0.794309
32 2017-03-16 0.578911
33 2017-03-17 0.747763
34 2017-03-20 0.081113
35 2017-03-21 0.000058
36 2017-03-22 0.274022
37 2017-03-23 0.367831
38 2017-03-24 0.100930
naive time test

df[df['Date'] >= Date(2017,02,28)][:10]
I guess?

Related

time stamp - how to calculate time difference in seconds with a groupby

I have a pandas dataframe with id and date as the 2 columns - the date column has all the way to seconds.
data = {'id':[17,17,17,17,17,18,18,18,18],'date':['2018-01-16','2018-01-26','2018-01-27','2018-02-11',
'2018-03-14','2018-01-28','2018-02-12','2018-02-25','2018-03-04'],
}
df1 = pd.DataFrame(data)
I would like to have a new column - (tslt) - 'time_since_last_transaction'. The first transaction for each unique user_id could be a number say 1. Each subsequent transaction for that user should measure the difference between the 1st time stamp for that user and its current time stamp to generate a time difference in seconds.
I used the datetime and timedelta etc. but did not have too much of luck. Any help would be appreciated.

You can try groupby().transform():
df1['date'] = pd.to_datetime(df1['date'])
df1['diff'] = df1['date'].sub(df1.groupby('id').date.transform('min')).dt.total_seconds()
Output:
id date diff
0 17 2018-01-16 0.0
1 17 2018-01-26 864000.0
2 17 2018-01-27 950400.0
3 17 2018-02-11 2246400.0
4 17 2018-03-14 4924800.0
5 18 2018-01-28 0.0
6 18 2018-02-12 1296000.0
7 18 2018-02-25 2419200.0
8 18 2018-03-04 3024000.0

Calculation in grouped dataframe with date type index

I have a dataset like:
date_time value
30.04.20 9:31 1
30.04.20 10:12 5
30.04.20 15:16 2
01.05.20 12:01 63
01.05.20 13:00 78
02.05.20 7:23 4
02.05.20 17:34 2
02.05.20 18:34 4
02.05.20 21:39 3458
03.05.20 9:34 77
03.05.20 14:54 4
03.05.20 16:54 7
04.05.20 15:24 35
I need to group records within a day and calculate the average over 3 days (day_before-today-next_day) period as follows (desired result):
date value
01.05.2020 3617
02.05.2020 3697
03.05.2020 3591
I wrote the beginning of the code
import pandas as pd
df = pd.read_excel(...)
df['date'] = df['date_time'].dt.normalize()
df.groupby('date').sum()
The grouped dataframe here looks like:
date value
30.04.2020 8
01.05.2020 141
02.05.2020 3468
03.05.2020 88
04.05.2020 35
But I can't go further because I don't understand how to get the desired result in a concise "pandas" way. Please give me some pointers.

You almost have done your work, just add these lines of code to your current solution:
df_group = df.groupby('date').sum()
results = df_group.rolling(window=3, min_periods=3, center=True).sum()
print(results)
2020-04-30 NaN
2020-05-01 3617.0
2020-05-02 3697.0
2020-05-03 3591.0
2020-05-04 NaN
# retain only rows with values
print(results.dropna())
date
2020-05-01 3617.0
2020-05-02 3697.0
2020-05-03 3591.0
Hope this helps!

Vectorized count of daily longest consecutive streak

For evaluating daily longest consecutive runtimes of a power plant, I have to evaluate the longest streak per day, meaning that each day is considered as a separate timeframe.
So let's say I've got the power output in the dataframe df:
df = pd.Series(
data=[
*np.zeros(4), *(np.full(24*5, 19.5) + np.random.rand(24*5)),
*np.zeros(4), *(np.full(8, 19.5) + np.random.rand(8)),
*np.zeros(5), *(np.full(24, 19.5) + np.random.rand(24)),
*np.zeros(27), *(np.full(24, 19.5) + np.random.rand(24))],
index=pd.date_range(start='2019-07-01 00:00:00', periods=9*24, freq='1h'))
And the "cutoff-power" is 1 (everything below that is considered as off). I use this to mask the "on"-values, shift and compare the mask to itself to count the number of consecutive groups. Finally I group the groups by the days of the year in the index and count the daily consecutive values consec_group:
mask = df > 1
groups = mask.ne(mask.shift()).cumsum()
consec_group = groups[mask].groupby(groups[mask].index.date).value_counts()
Which yields:
consec_group
Out[3]:
2019-07-01 2 20
2019-07-02 2 24
2019-07-03 2 24
2019-07-04 2 24
2019-07-05 2 24
2019-07-06 4 8
2 4
6 3
2019-07-07 6 21
2019-07-09 8 24
dtype: int64
But I'd like to have the maximum value of each consecutive daily streak and dates without any runtime should be displayed with zeros, as in 2019-07-08 7 0. See the expected result:
2019-07-01 20
2019-07-02 24
2019-07-03 24
2019-07-04 24
2019-07-05 24
2019-07-06 8
2019-07-07 21
2019-07-08 0
2019-07-09 24
dtype: int64
Any help will be appreciated!

First remove second level by Series.reset_index, filter out second duplicated values by call back with Series.asfreq - it working, because .value_counts sort Series:
consec_group = (consec_group.reset_index(level=1, drop=True)[lambda x: ~x.index.duplicated()]
.asfreq('d', fill_value=0))
print (consec_group)
Or solution with GroupBy.first:
consec_group = (consec_group.groupby(level=0)
.first()
.asfreq('d', fill_value=0))
print (consec_group)
2019-07-01 20
2019-07-02 24
2019-07-03 24
2019-07-04 24
2019-07-05 24
2019-07-06 8
2019-07-07 21
2019-07-08 0
2019-07-09 24
Freq: D, dtype: int64

Ok, I guess I was too close to the finish line to see the answer... Looks like I had already solved the complex part.
So right after posting the question, I tested max with the level=0 argument instead of level=1 and that was the solution:
max_consec_group = consec_group.max(level=0).asfreq('d', fill_value=0)
Thanks at jezrael for the asfreq part!

filtering date column in python

I'm new to python and I'm facing the following problem. I have a dataframe composed of 2 columns, one of them is date (datetime64[ns]). I want to keep all records within the last 12 months. My code is the following:
today=start_time.date()
last_year = today + relativedelta(months = -12)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
when I run it I get the following message:
TypeError: type object 2017-06-05
Any ideas?
last_year seems to bring me the date that I want in the following format: 2017-06-05

Create a time delta object in pandas to increment the date (12 months). Call pandas.Timstamp('now') to get the current date. And then create a date_range. Here is an example for getting monthly data for 12 months.
import pandas as pd
import datetime
list_1 = [i for i in range(0, 12)]
list_2 = [i for i in range(13, 25)]
list_3 = [i for i in range(26, 38)]
data_frame = pd.DataFrame({'A': list_1, 'B': list_2, 'C':list_3}, pd.date_range(pd.Timestamp('now'), pd.Timestamp('now') + pd.Timedelta (weeks=53), freq='M'))
We create a timestamp for the current date and enter that as our start date. Then we create a timedelta to increment that date by 53 weeks (or 52 if you'd like) which gets us 12 months of data. Below is the output:
A B C
2018-06-30 05:05:21.335625 0 13 26
2018-07-31 05:05:21.335625 1 14 27
2018-08-31 05:05:21.335625 2 15 28
2018-09-30 05:05:21.335625 3 16 29
2018-10-31 05:05:21.335625 4 17 30
2018-11-30 05:05:21.335625 5 18 31
2018-12-31 05:05:21.335625 6 19 32
2019-01-31 05:05:21.335625 7 20 33
2019-02-28 05:05:21.335625 8 21 34
2019-03-31 05:05:21.335625 9 22 35
2019-04-30 05:05:21.335625 10 23 36
2019-05-31 05:05:21.335625 11 24 37

Try
today = datetime.datetime.now()

You can use pandas functionality with datetime objects. The syntax is often more intuitive and obviates the need for additional imports.
last_year = pd.to_datetime('today') + pd.DateOffset(years=-1)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
As such, we would need to see all your code to be sure of the reason behind your error; for example, how is start_time defined?

Getting the average of a certain hour on weekdays over several years in a pandas dataframe

I have an hourly dataframe in the following format over several years:
Date/Time Value
01.03.2010 00:00:00 60
01.03.2010 01:00:00 50
01.03.2010 02:00:00 52
01.03.2010 03:00:00 49
.
.
.
31.12.2013 23:00:00 77
I would like to average the data so I can get the average of hour 0, hour 1... hour 23 of each of the years.
So the output should look somehow like this:
Year Hour Avg
2010 00 63
2010 01 55
2010 02 50
.
.
.
2013 22 71
2013 23 80
Does anyone know how to obtain this in pandas?

Note: Now that Series have the dt accessor it's less important that date is the index, though Date/Time still needs to be a datetime64.
Update: You can do the groupby more directly (without the lambda):
In [21]: df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
Out[21]:
Value
Date/Time Date/Time
2010 0 60
1 50
2 52
3 49
In [22]: res = df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
In [23]: res.index.names = ["year", "hour"]
In [24]: res
Out[24]:
Value
year hour
2010 0 60
1 50
2 52
3 49
If it's a datetime64 index you can do:
In [31]: df1.groupby([df1.index.year, df1.index.hour]).mean()
Out[31]:
Value
2010 0 60
1 50
2 52
3 49
Old answer (will be slower):
Assuming Date/Time was the index* you can use a mapping function in the groupby:
In [11]: year_hour_means = df1.groupby(lambda x: (x.year, x.hour)).mean()
In [12]: year_hour_means
Out[12]:
Value
(2010, 0) 60
(2010, 1) 50
(2010, 2) 52
(2010, 3) 49
For a more useful index, you could then create a MultiIndex from the tuples:
In [13]: year_hour_means.index = pd.MultiIndex.from_tuples(year_hour_means.index,
names=['year', 'hour'])
In [14]: year_hour_means
Out[14]:
Value
year hour
2010 0 60
1 50
2 52
3 49
* if not, then first use set_index:
df1 = df.set_index('Date/Time')

If your date/time column were in the datetime format (see dateutil.parser for automatic parsing options), you can use pandas resample as below:
year_hour_means = df.resample('H',how = 'mean')
which will keep your data in the datetime format. This may help you with whatever you are going to be doing with your data down the line.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

selecting rows in a pandas dataframe starting with a certain index value - python

df[df['Date'] >= Date(2017,02,28)][:10] I guess?

Related

time stamp - how to calculate time difference in seconds with a groupby

Calculation in grouped dataframe with date type index

Vectorized count of daily longest consecutive streak

filtering date column in python

Getting the average of a certain hour on weekdays over several years in a pandas dataframe

Categories

Resources