Pandas: Counting frequency of datetime objects in a column

Pandas: Counting frequency of datetime objects in a column - python

I have a column (from my original data) that I have converted from a string to a datetime-object in Pandas.
The column looks like this:
0 2012-01-15 11:10:12
1 2012-01-15 11:15:01
2 2012-01-16 11:15:12
3 2012-01-16 11:25:01
...
4 2012-01-22 11:25:11
5 2012-01-22 11:40:01
6 2012-01-22 11:40:18
7 2012-01-23 11:40:23
8 2012-01-23 11:40:23
...
9 2012-01-30 11:50:02
10 2012-01-30 11:50:41
11 2012-01-30 12:00:01
12 2012-01-30 12:00:34
13 2012-01-30 12:45:01
...
14 2012-02-05 12:45:13
15 2012-01-05 12:55:01
15 2012-01-05 12:55:01
16 2012-02-05 12:56:11
17 2012-02-05 13:10:01
...
18 2012-02-11 13:10:11
...
19 2012-02-20 13:25:02
20 2012-02-20 13:26:14
21 2012-02-20 13:30:01
...
22 2012-02-25 13:30:08
23 2012-02-25 13:30:08
24 2012-02-25 13:30:08
25 2012-02-26 13:30:08
26 2012-02-27 13:30:08
27 2012-02-27 13:30:08
28 2012-02-27 13:30:25
29 2012-02-27 13:30:25
What I would like to do is to count the frequency of each date occurring. As you can see, I have left some dates out, but if I were to compute the frequency manually (for visible values), I would have:
2012-01-15 - 2 (frequency)
2012-01-16 - 2
2012-01-22 - 3
2012-01-23 - 2
2012-01-30 - 5
2012-02-05 - 5
2012-02-11 - 1
2012-02-20 - 3
2012-02-25 - 3
2012-02-26 - 1
2012-02-27 - 4
This is the daily frequency and I would like to count it. I have so far tried this:
df[df.str.contains(r'^\d\d\d\d-\d\d-\d\d')].value_counts()
I know it fails because these are not 'string' objects, but I am not sure how else to count this.
I have also looked at the .dt property, but the Pandas documentation is very verbose on these simple frequency calculations.
Also, to generalize this, how would I:
Apply the daily frequency to weekly frequency (eg. Monday to Sunday)
Apply daily frequency to monthly frequency (eg. how many times I see "2012-01-**" in my column)
Using the daily/weekly/monthly restrictions across other columns (eg. if I have a column that contains "GET requests", I would like to know how many occurred daily, then weekly and then monthly)
Applying a weekly restriction with another restriction (eg. I have a column that returns "404 Not found" and I would like to check how many "404 Not found I received per week" )
Perhaps the solution is a long one, where I may need to do lots of: split-apply-combine ... but I was made to believe that Pandas simplifies/abstracts away a lot of the work, which is why I am stuck now.
The source of this file could be considered something equivalent to a server-log file.

You can first get the date part of the datetime, and then use value_counts:
s.dt.date.value_counts()
Small example:
In [12]: s = pd.Series(pd.date_range('2012-01-01', freq='11H', periods=6))
In [13]: s
Out[13]:
0 2012-01-01 00:00:00
1 2012-01-01 11:00:00
2 2012-01-01 22:00:00
3 2012-01-02 09:00:00
4 2012-01-02 20:00:00
5 2012-01-03 07:00:00
dtype: datetime64[ns]
In [14]: s.dt.date
Out[14]:
0 2012-01-01
1 2012-01-01
2 2012-01-01
3 2012-01-02
4 2012-01-02
5 2012-01-03
dtype: object
In [15]: s.dt.date.value_counts()
Out[15]:
2012-01-01 3
2012-01-02 2
2012-01-03 1
dtype: int64

Late to the party, but nowadays it is dataframe.date_time_column.resample('1D').count()

you can try this:
df.groupby(level=0).count()
this requires your date to be index.

Related

Create custom sized bins of datetime Series in Pandas

I have multiple Pandas Series of datetime64 values that I want to bin into groups using arbitrary bin sizes.
I've found the Series.to_period() function which does exactly what I want except that I need more control over the chosen bin size. to_period allows me to bin by full years, months, days, etc. but I also want to bin by 5 years, 6 hours or 15 minutes. Using a syntax like 5Y, 6H or 15min works in other corners of Pandas but apparently not here.
s = pd.Series(["2020-02-01", "2020-02-02", "2020-02-03", "2020-02-04"], dtype="datetime64[ns]")
# Output as expected
s.dt.to_period("M").value_counts()
2020-02 4
Freq: M, dtype: int64
# Output as expected
s.dt.to_period("W").value_counts()
2020-01-27/2020-02-02 2
2020-02-03/2020-02-09 2
Freq: W-SUN, dtype: int64
# Output as expected
s.dt.to_period("D").value_counts()
2020-02-01 1
2020-02-02 1
2020-02-03 1
2020-02-04 1
Freq: D, dtype: int64
# Output unexpected (and wrong?)
s.dt.to_period("2D").value_counts()
2020-02-01 1
2020-02-02 1
2020-02-03 1
2020-02-04 1
Freq: 2D, dtype: int64

I believe that pd.Grouper is what you're looking for.
https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html
It provides the flexibility of having multiple frequencies in addition to the standard ones: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
From the documentation:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min')).sum()
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, dtype: int64
NOTE:
If you'd like to .groupby a certain column then use the following syntax:
df.groupby(pd.Grouper(key="my_col", freq="3M"))

Resample pandas dataframe by two columns

I have a Pandas dataframe that describes arrivals at stations. It has two columns: time and station id.
Example:
time id
0 2019-10-31 23:59:36 22
1 2019-10-31 23:58:23 260
2 2019-10-31 23:54:55 82
3 2019-10-31 23:54:46 82
4 2019-10-31 23:54:42 21
I would like to resample this into five minute blocks, which shows the number of arrivals at the station in the time-block that starts at the time, so it should look like this:
time id arrivals
0 2019-10-31 23:55:00 22 1
1 2019-10-31 23:50:00 22 5
2 2019-10-31 23:55:00 82 0
3 2019-10-31 23:25:00 82 325
4 2019-10-31 23:21:00 21 1
How could I use some high performance function to achieve this?
pandas.DataFrame.resample does not seem to be a possibility, since it requires the index to be a timestamp, and in this case several rows can have the same time.

df.groupby(['id',pd.Grouper(key='time', freq='5min')])\
.size()\
.to_frame('arrivals')\
.reset_index()

I think it's a horrible solution (couldn't find a better one at the moment), but it more or less gets you where you want:
df.groupby("id").resample("5min", on="time").count()[["id"]].swaplevel(0, 1, axis=0).sort_index(axis=0).set_axis(["arrivals"], axis=1)

Try with groupby and resample:
>>> df.set_index("time").groupby("id").resample("5min").count()
id
id time
21 2019-10-31 23:50:00 1
22 2019-10-31 23:55:00 1
82 2019-10-31 23:50:00 2
260 2019-10-31 23:55:00 1

Merging two dataframes and removing duplicate rows WITH duplicate indexes (pandas)

I have read miscellaneous posts with a similar question but couldn't find exactly this question.
I have two pandas DataFrames that I want to merge.
They have timestamps as indexes.
The 2nd Dataframe basically overlaps the 1st and they thus both share rows with same timestamps and values.
I would like to remove these rows because they share everything: index and values in columns.
If they don't share both index and values in columns, I want to keep them.
So far, I could point out:
Index.drop_duplicate: this is not what I am looking for. It doesn't check values in columns are the same. And I want to keep rows with same timestamps but different values in columns
DataFrame.drop_duplicate: well, same as above, it doesn't check index value, and if rows are found with same values in column but different indexes, I want to keep them.
To give an example, I am re-using the data given in below answer.
df1
Value
2012-02-01 12:00:00 10
2012-02-01 12:30:00 10
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
df2
Value
2012-02-01 12:30:00 20
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
2012-02-02 14:00:00 10
Result I would like to obtain is the following one:
Value
2012-02-01 12:00:00 10 #(from df1)
2012-02-01 12:30:00 10 #(from df1)
2012-02-01 12:30:00 20 #(from df2 - same index than in df1, but different value)
2012-02-01 13:00:00 20 #(in df1 & df2, only one kept)
2012-02-01 13:30:00 30 #(in df1 & df2, only one kept)
2012-02-02 14:00:00 10 #(from df2)
Please, any idea?
Thanks for your help!
Bests

Assume that you have 2 following DataFrames:
df:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 13:00:00 20
3 2012-02-01 13:30:00 30
4 2012-02-02 14:00:00 10
5 2012-02-02 14:30:00 10
6 2012-02-02 15:00:00 20
7 2012-02-02 15:30:00 30
df2:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 21
2 2012-02-01 12:40:00 22
3 2012-02-01 13:00:00 20
4 2012-02-01 13:30:00 30
To generate the result, run:
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates().reset_index(drop=True)
The result, for the above data, is:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 12:30:00 21
3 2012-02-01 12:40:00 22
4 2012-02-01 13:00:00 20
5 2012-02-01 13:30:00 30
6 2012-02-02 14:00:00 10
7 2012-02-02 14:30:00 10
8 2012-02-02 15:00:00 20
9 2012-02-02 15:30:00 30
drop_duplicates drops duplicated rows, keeping the first.
Since no subset parameter has been passed, the criterion to treat
2 rows as duplicates is identity of all columns.

Just improving the first answer, insert Date inside drop_duplicates
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates('Date').reset_index(drop=True)

How to define if-else function using dataframe columns as arguments in python?

I need to write a function and then apply it for a dataframe's column in pandas.
My dataframe looks like this.Data is sorted by id and then by period columns.
period id column1
0 2013-01-31 5 NaT
1 2013-02-28 5 28 days
2 2013-03-31 5 31 days
3 2013-04-30 5 30 days
4 2016-05-31 6 NaT
5 2016-06-30 6 30 days
6 2016-08-31 6 62 days
The new column values should be defined according to values in column1:
if column1=NaT or column1>31
then new column eqauls to the value in period column
Else - values of new column should be copied from its previous row:
new column ith row= new column i-1 row.
I am very new to python and my code doesn't work:
def f(x):
if not x or x > 31
return x=df['period']
else
return x=x.shift()
df['newcolumn'] = df['column1'].apply(f)
The output should be this:
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31
Any help would be much appreciated.

first it might be necessary to convert period to datetime: using pd.to_datetime
df['period']=pd.to_datetime(df['period'])
Then you can Use Dataframe.where with DataFrame.ffill:
df['newcolumn']=df['period'].where((df["column1"]>pd.Timedelta("31 days"))|(df["column1"].isnull())).ffill()
print(df)
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31

you can use df.where(cond, other) which return return df's row if condition match else returns other
df["newcolumn"] = df["period"].where(df["column1"].isnull() | (df["column1"]>pd.TimeDelta("31D")), df["column1"].shift())

python pandas series loc value from multi index

I have a series that looks like this
2014 7 2014-07-01 -0.045417
8 2014-08-01 -0.035876
9 2014-09-02 -0.030971
10 2014-10-01 -0.027471
11 2014-11-03 -0.032968
12 2014-12-01 -0.031110
2015 1 2015-01-02 -0.028906
2 2015-02-02 -0.035563
3 2015-03-02 -0.040338
4 2015-04-01 -0.032770
5 2015-05-01 -0.025762
6 2015-06-01 -0.019746
7 2015-07-01 -0.018541
8 2015-08-03 -0.028101
9 2015-09-01 -0.043237
10 2015-10-01 -0.053565
11 2015-11-02 -0.062630
12 2015-12-01 -0.064618
2016 1 2016-01-04 -0.064852
I want to be able to get the value from a date. Something like:
myseries.loc('2015-10-01') and it returns -0.053565
The index are tuples in the form (2016, 1, 2016-01-04)

You can do it like this:
In [32]:
df.loc(axis=0)[:,:,'2015-10-01']
Out[32]:
value
year month date
2015 10 2015-10-01 -0.053565
You can also pass slice for each level:
In [39]:
df.loc[(slice(None),slice(None),'2015-10-01'),]
Out[39]:
value
year month date
2015 10 2015-10-01 -0.053565|
Or just pass the first 2 index levels:
In [40]:
df.loc[2015,10]
Out[40]:
value
date
2015-10-01 -0.053565

Try xs:
print s.xs('2015-10-01',level=2,axis=0)
#year datetime
#2015 10 -0.053565
#Name: series, dtype: float64
print s.xs(7,level=1,axis=0)
#year datetime
#2014 2014-07-01 -0.045417
#2015 2015-07-01 -0.018541
#Name: series, dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Counting frequency of datetime objects in a column - python

Late to the party, but nowadays it is dataframe.date_time_column.resample('1D').count()

you can try this: df.groupby(level=0).count() this requires your date to be index.

Related

Create custom sized bins of datetime Series in Pandas

Resample pandas dataframe by two columns

Merging two dataframes and removing duplicate rows WITH duplicate indexes (pandas)

How to define if-else function using dataframe columns as arguments in python?

python pandas series loc value from multi index

Categories

Resources