Resample pandas dataframe by two columns - python

I have a Pandas dataframe that describes arrivals at stations. It has two columns: time and station id.
Example:
time id
0 2019-10-31 23:59:36 22
1 2019-10-31 23:58:23 260
2 2019-10-31 23:54:55 82
3 2019-10-31 23:54:46 82
4 2019-10-31 23:54:42 21
I would like to resample this into five minute blocks, which shows the number of arrivals at the station in the time-block that starts at the time, so it should look like this:
time id arrivals
0 2019-10-31 23:55:00 22 1
1 2019-10-31 23:50:00 22 5
2 2019-10-31 23:55:00 82 0
3 2019-10-31 23:25:00 82 325
4 2019-10-31 23:21:00 21 1
How could I use some high performance function to achieve this?
pandas.DataFrame.resample does not seem to be a possibility, since it requires the index to be a timestamp, and in this case several rows can have the same time.

df.groupby(['id',pd.Grouper(key='time', freq='5min')])\
.size()\
.to_frame('arrivals')\
.reset_index()

I think it's a horrible solution (couldn't find a better one at the moment), but it more or less gets you where you want:
df.groupby("id").resample("5min", on="time").count()[["id"]].swaplevel(0, 1, axis=0).sort_index(axis=0).set_axis(["arrivals"], axis=1)

Try with groupby and resample:
>>> df.set_index("time").groupby("id").resample("5min").count()
id
id time
21 2019-10-31 23:50:00 1
22 2019-10-31 23:55:00 1
82 2019-10-31 23:50:00 2
260 2019-10-31 23:55:00 1

Related

Dataframe - how to insert new row with null value, conditionally based on elapsed time?

Background: My dataset aquires values at roughly 5 minute intervals, but sometimes there are gaps. I am charting my dataset using Plotly and attempting to resolve an issue where a straight line is drawn between points if there is a gap in the dataset. Plotly has a parameter connectgaps which if set to false will not connect over 'nan' values. However, my dataset looks like this:
(where I have computed the time difference using df['time_diff_mins'] = (df['datetime'].shift(-1) - df['datetime']).dt.total_seconds() / 60)
datetime value time_diff_mins
0 2022-03-09 09:25:00 98 5
1 2022-03-09 09:30:00 104 21
2 2022-03-09 09:51:00 105 3
3 2022-03-09 09:54:00 110 nan
If you look at rows 1 and 2, the time difference is 21 minutes. For this reason, I don't want the values 104 and 105 to be connected - I want a break in the line if there is a gap of greater than 15 mins and 15 seconds.
So, I am trying to insert a new row with null/nan values in my dataframe if the time difference between rows is greater than 15 mins and 15 seconds, so that Plotly will not connect the gaps.
Desired output:
datetime value
0 2022-03-09 09:25:00 98
1 2022-03-09 09:30:00 104
2 2022-03-09 09:40:30 nan
3 2022-03-09 09:51:00 105
4 2022-03-09 09:54:00 110
I hope that makes sense. I know that inserting rows programmatically is probably not an optimal solution, so I haven't been able to find a good answer to this.
You can use a mask and pandas.concat
df['datetime'] = pd.to_datetime(df['datetime'])
delta = '15 min 15 s'
d = df['datetime'].diff().shift(-1)
out = (pd.concat([df,
df['datetime'].add(d/2).
.loc[d.gt(delta)].to_frame()
])
.sort_index()
)
Output:
datetime value time_diff_mins
0 2022-03-09 09:25:00 98.0 5.0
1 2022-03-09 09:30:00 104.0 21.0
1 2022-03-09 09:40:30 NaN NaN
2 2022-03-09 09:51:00 105.0 3.0
3 2022-03-09 09:54:00 110.0 NaN

How to get 1 for 8 days after a date in pandas and 0 otherwise?

I have two dataframes:
daily = pd.DataFrame({'Date': pd.date_range(start="2021-01-01",end="2021-04-29")})
pc21 = pd.DataFrame({'Date': ["2021-01-21", "2021-03-11", "2021-04-22"]})
pc21['Date'] = pd.to_datetime(pc21['Date'])
What I want to do is the following: for every date in pc21 and if the date in pc21 is in daily, I want to get, in a new column, values equal 1 for 8 days after the date and 0 otherwise.
This is an example of a desired output:
# 2021-01-21 is in either daframes so I want a new column in 'daily' that looks like this:
Date newcol
.
.
.
2021-01-20 0
2021-01-21 1
2021-01-22 1
2021-01-23 1
2021-01-24 1
2021-01-25 1
2021-01-26 1
2021-01-27 1
2021-01-28 1
2021-01-29 0
.
.
.
Can anyone help me achieve this?
Thanks!
you can try the following approach:
res = (daily
.merge(pd.concat([pd.date_range(d, freq="D", periods=8).to_frame(name="Date")
for d in pc21["Date"]]),
how="left", indicator=True)
.replace({"both": 1, "left_only":0})
.rename(columns={"_merge":"newcol"}))
result
In [15]: res
Out[15]:
Date newcol
0 2021-01-01 0
1 2021-01-02 0
2 2021-01-03 0
3 2021-01-04 0
4 2021-01-05 0
.. ... ...
114 2021-04-25 1
115 2021-04-26 1
116 2021-04-27 1
117 2021-04-28 1
118 2021-04-29 1
[119 rows x 2 columns]
daily['value'] = 0
pc21['value'] = 1
daily = pd.merge(daily, pc21, on='Date', how='left').rename(
columns={'value_y':'value'}).drop('value_x', 1).fillna(method="ffill", limit=7).fillna(0)
pc21.drop('value',1)
Output Subset
daily.query('value.eq(1)')
Date value
20 2021-01-21 1.0
21 2021-01-22 1.0
22 2021-01-23 1.0
23 2021-01-24 1.0
24 2021-01-25 1.0
25 2021-01-26 1.0
26 2021-01-27 1.0
27 2021-01-28 1.0
69 2021-03-11 1.0
daily["new_col"] = np.where(daily.Date.isin(pc21.Date), 1, np.nan)
daily["new_col"] = daily["new_col"].fillna(method="ffill", limit=7).fillna(0)
We generate the new column first:
If the Date of daily is in Date of pc21
then put 1
else
put a NaN
Then forward fill that column but with a limit of 7 so that we have 8 consecutive 1s
Lastly forward fill again the remaining NaNs with 0.
(you can put an astype(int) at the end to have integers).

Split rows in a column and plot graph for a dataframe. Python

My data set contains the data of days and hrs
time slot hr_slot location_point
2019-01-21 00:00:00 0 34
2019-01-21 01:00:00 1 564
2019-01-21 02:00:00 2 448
2019-01-21 03:00:00 3 46
.
.
.
.
2019-01-22 23:00:00 23 78
2019-01-22 00:00:00 0 34
2019-01-22 01:00:00 1 165
2019-01-22 02:00:00 2 65
2019-01-22 03:00:00 3 156
.
.
.
.
2019-01-22 23:00:00 23 78
The data set conatins 7 days. that is 7*24 row. How to plot the graph for the dataset above.
hr_slot on the X axis : (0-23 hours)
loaction_point on Y axis : (location_point)
and each day should have different color on the graph: (Day1: color1, Day2:color2....)
Consider pivoting your data first:
# Create normalized date column
df['date'] = df['time slot'].dt.date.astype(str)
# Pivot
piv = df.pivot(index='hr_slot', columns='date', values='location_point')
piv.plot()
Update
To filter which dates are plotted, using loc or iloc:
# Exclude first and last day
piv.iloc[:, 1:-1].plot()
# Include specific dates only
piv.loc[:, ['2019-01-21', '2019-01-22']].plot()
Alternate approach using pandas.crosstab instead:
(pd.crosstab(df['hr_slot'],
df['time slot'].dt.date,
values=df['location_point'],
aggfunc='sum')
.plot())

best way to fill up gaps by yearly dates in Python dataframe

all, I'm newbie to Python and am stuck with the problem below. I have a DF as:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2013-01-02 93
5 2017-02-01 43
For the yearly gaps, say 2012, 2014, 2015, and 2016, I'd like to fill in the gap using the new year date for each of the missing years, and port_id from previous year. Ideally, I'd like:
ipdb> DF
asofdate port_id
1 2010-01-01 76
2 2010-04-01 43
3 2011-02-01 76
4 2012-01-01 76
5 2013-01-02 93
6 2014-01-01 93
7 2015-01-01 93
8 2016-01-01 93
9 2017-02-01 43
I tried multiple approaches but still no avail. Could some expert shed me some lights on how to make it work out? Thanks much in advance!
You can use set.difference with range to find missing dates and then append a dataframe:
# convert to datetime if not already converted
df['asofdate'] = pd.to_datetime(df['asofdate'])
# calculate missing years
years = df['asofdate'].dt.year
missing = set(range(years.min(), years.max())) - set(years)
# append dataframe, sort and front-fill
df = df.append(pd.DataFrame({'asofdate': pd.to_datetime(list(missing), format='%Y')}))\
.sort_values('asofdate')\
.ffill()
print(df)
asofdate port_id
1 2010-01-01 76.0
2 2010-04-01 43.0
3 2011-02-01 76.0
1 2012-01-01 76.0
4 2013-01-02 93.0
2 2014-01-01 93.0
3 2015-01-01 93.0
0 2016-01-01 93.0
5 2017-02-01 43.0
I would create a helper dataframe, containing all the year start dates, then filter out the ones where the years match what is in df, and finally merge them together:
# First make sure it is proper datetime
df['asofdate'] = pd.to_datetime(df.asofdate)
# Create your temporary dataframe of year start dates
helper = pd.DataFrame({'asofdate':pd.date_range(df.asofdate.min(), df.asofdate.max(), freq='YS')})
# Filter out the rows where the year is already in df
helper = helper[~helper.asofdate.dt.year.isin(df.asofdate.dt.year)]
# Merge back in to df, sort, and forward fill
new_df = df.merge(helper, how='outer').sort_values('asofdate').ffill()
>>> new_df
asofdate port_id
0 2010-01-01 76.0
1 2010-04-01 43.0
2 2011-02-01 76.0
5 2012-01-01 76.0
3 2013-01-02 93.0
6 2014-01-01 93.0
7 2015-01-01 93.0
8 2016-01-01 93.0
4 2017-02-01 43.0

Pandas: Counting frequency of datetime objects in a column

I have a column (from my original data) that I have converted from a string to a datetime-object in Pandas.
The column looks like this:
0 2012-01-15 11:10:12
1 2012-01-15 11:15:01
2 2012-01-16 11:15:12
3 2012-01-16 11:25:01
...
4 2012-01-22 11:25:11
5 2012-01-22 11:40:01
6 2012-01-22 11:40:18
7 2012-01-23 11:40:23
8 2012-01-23 11:40:23
...
9 2012-01-30 11:50:02
10 2012-01-30 11:50:41
11 2012-01-30 12:00:01
12 2012-01-30 12:00:34
13 2012-01-30 12:45:01
...
14 2012-02-05 12:45:13
15 2012-01-05 12:55:01
15 2012-01-05 12:55:01
16 2012-02-05 12:56:11
17 2012-02-05 13:10:01
...
18 2012-02-11 13:10:11
...
19 2012-02-20 13:25:02
20 2012-02-20 13:26:14
21 2012-02-20 13:30:01
...
22 2012-02-25 13:30:08
23 2012-02-25 13:30:08
24 2012-02-25 13:30:08
25 2012-02-26 13:30:08
26 2012-02-27 13:30:08
27 2012-02-27 13:30:08
28 2012-02-27 13:30:25
29 2012-02-27 13:30:25
What I would like to do is to count the frequency of each date occurring. As you can see, I have left some dates out, but if I were to compute the frequency manually (for visible values), I would have:
2012-01-15 - 2 (frequency)
2012-01-16 - 2
2012-01-22 - 3
2012-01-23 - 2
2012-01-30 - 5
2012-02-05 - 5
2012-02-11 - 1
2012-02-20 - 3
2012-02-25 - 3
2012-02-26 - 1
2012-02-27 - 4
This is the daily frequency and I would like to count it. I have so far tried this:
df[df.str.contains(r'^\d\d\d\d-\d\d-\d\d')].value_counts()
I know it fails because these are not 'string' objects, but I am not sure how else to count this.
I have also looked at the .dt property, but the Pandas documentation is very verbose on these simple frequency calculations.
Also, to generalize this, how would I:
Apply the daily frequency to weekly frequency (eg. Monday to Sunday)
Apply daily frequency to monthly frequency (eg. how many times I see "2012-01-**" in my column)
Using the daily/weekly/monthly restrictions across other columns (eg. if I have a column that contains "GET requests", I would like to know how many occurred daily, then weekly and then monthly)
Applying a weekly restriction with another restriction (eg. I have a column that returns "404 Not found" and I would like to check how many "404 Not found I received per week" )
Perhaps the solution is a long one, where I may need to do lots of: split-apply-combine ... but I was made to believe that Pandas simplifies/abstracts away a lot of the work, which is why I am stuck now.
The source of this file could be considered something equivalent to a server-log file.
You can first get the date part of the datetime, and then use value_counts:
s.dt.date.value_counts()
Small example:
In [12]: s = pd.Series(pd.date_range('2012-01-01', freq='11H', periods=6))
In [13]: s
Out[13]:
0 2012-01-01 00:00:00
1 2012-01-01 11:00:00
2 2012-01-01 22:00:00
3 2012-01-02 09:00:00
4 2012-01-02 20:00:00
5 2012-01-03 07:00:00
dtype: datetime64[ns]
In [14]: s.dt.date
Out[14]:
0 2012-01-01
1 2012-01-01
2 2012-01-01
3 2012-01-02
4 2012-01-02
5 2012-01-03
dtype: object
In [15]: s.dt.date.value_counts()
Out[15]:
2012-01-01 3
2012-01-02 2
2012-01-03 1
dtype: int64
Late to the party, but nowadays it is dataframe.date_time_column.resample('1D').count()
you can try this:
df.groupby(level=0).count()
this requires your date to be index.

Categories