I have a column (from my original data) that I have converted from a string to a datetime-object in Pandas.
The column looks like this:
0 2012-01-15 11:10:12
1 2012-01-15 11:15:01
2 2012-01-16 11:15:12
3 2012-01-16 11:25:01
...
4 2012-01-22 11:25:11
5 2012-01-22 11:40:01
6 2012-01-22 11:40:18
7 2012-01-23 11:40:23
8 2012-01-23 11:40:23
...
9 2012-01-30 11:50:02
10 2012-01-30 11:50:41
11 2012-01-30 12:00:01
12 2012-01-30 12:00:34
13 2012-01-30 12:45:01
...
14 2012-02-05 12:45:13
15 2012-01-05 12:55:01
15 2012-01-05 12:55:01
16 2012-02-05 12:56:11
17 2012-02-05 13:10:01
...
18 2012-02-11 13:10:11
...
19 2012-02-20 13:25:02
20 2012-02-20 13:26:14
21 2012-02-20 13:30:01
...
22 2012-02-25 13:30:08
23 2012-02-25 13:30:08
24 2012-02-25 13:30:08
25 2012-02-26 13:30:08
26 2012-02-27 13:30:08
27 2012-02-27 13:30:08
28 2012-02-27 13:30:25
29 2012-02-27 13:30:25
What I would like to do is to count the frequency of each date occurring. As you can see, I have left some dates out, but if I were to compute the frequency manually (for visible values), I would have:
2012-01-15 - 2 (frequency)
2012-01-16 - 2
2012-01-22 - 3
2012-01-23 - 2
2012-01-30 - 5
2012-02-05 - 5
2012-02-11 - 1
2012-02-20 - 3
2012-02-25 - 3
2012-02-26 - 1
2012-02-27 - 4
This is the daily frequency and I would like to count it. I have so far tried this:
df[df.str.contains(r'^\d\d\d\d-\d\d-\d\d')].value_counts()
I know it fails because these are not 'string' objects, but I am not sure how else to count this.
I have also looked at the .dt property, but the Pandas documentation is very verbose on these simple frequency calculations.
Also, to generalize this, how would I:
Apply the daily frequency to weekly frequency (eg. Monday to Sunday)
Apply daily frequency to monthly frequency (eg. how many times I see "2012-01-**" in my column)
Using the daily/weekly/monthly restrictions across other columns (eg. if I have a column that contains "GET requests", I would like to know how many occurred daily, then weekly and then monthly)
Applying a weekly restriction with another restriction (eg. I have a column that returns "404 Not found" and I would like to check how many "404 Not found I received per week" )
Perhaps the solution is a long one, where I may need to do lots of: split-apply-combine ... but I was made to believe that Pandas simplifies/abstracts away a lot of the work, which is why I am stuck now.
The source of this file could be considered something equivalent to a server-log file.
You can first get the date part of the datetime, and then use value_counts:
s.dt.date.value_counts()
Small example:
In [12]: s = pd.Series(pd.date_range('2012-01-01', freq='11H', periods=6))
In [13]: s
Out[13]:
0 2012-01-01 00:00:00
1 2012-01-01 11:00:00
2 2012-01-01 22:00:00
3 2012-01-02 09:00:00
4 2012-01-02 20:00:00
5 2012-01-03 07:00:00
dtype: datetime64[ns]
In [14]: s.dt.date
Out[14]:
0 2012-01-01
1 2012-01-01
2 2012-01-01
3 2012-01-02
4 2012-01-02
5 2012-01-03
dtype: object
In [15]: s.dt.date.value_counts()
Out[15]:
2012-01-01 3
2012-01-02 2
2012-01-03 1
dtype: int64
Late to the party, but nowadays it is dataframe.date_time_column.resample('1D').count()
you can try this:
df.groupby(level=0).count()
this requires your date to be index.
Related
I have multiple Pandas Series of datetime64 values that I want to bin into groups using arbitrary bin sizes.
I've found the Series.to_period() function which does exactly what I want except that I need more control over the chosen bin size. to_period allows me to bin by full years, months, days, etc. but I also want to bin by 5 years, 6 hours or 15 minutes. Using a syntax like 5Y, 6H or 15min works in other corners of Pandas but apparently not here.
s = pd.Series(["2020-02-01", "2020-02-02", "2020-02-03", "2020-02-04"], dtype="datetime64[ns]")
# Output as expected
s.dt.to_period("M").value_counts()
2020-02 4
Freq: M, dtype: int64
# Output as expected
s.dt.to_period("W").value_counts()
2020-01-27/2020-02-02 2
2020-02-03/2020-02-09 2
Freq: W-SUN, dtype: int64
# Output as expected
s.dt.to_period("D").value_counts()
2020-02-01 1
2020-02-02 1
2020-02-03 1
2020-02-04 1
Freq: D, dtype: int64
# Output unexpected (and wrong?)
s.dt.to_period("2D").value_counts()
2020-02-01 1
2020-02-02 1
2020-02-03 1
2020-02-04 1
Freq: 2D, dtype: int64
I believe that pd.Grouper is what you're looking for.
https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html
It provides the flexibility of having multiple frequencies in addition to the standard ones: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
From the documentation:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min')).sum()
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, dtype: int64
NOTE:
If you'd like to .groupby a certain column then use the following syntax:
df.groupby(pd.Grouper(key="my_col", freq="3M"))
I have a Pandas dataframe that describes arrivals at stations. It has two columns: time and station id.
Example:
time id
0 2019-10-31 23:59:36 22
1 2019-10-31 23:58:23 260
2 2019-10-31 23:54:55 82
3 2019-10-31 23:54:46 82
4 2019-10-31 23:54:42 21
I would like to resample this into five minute blocks, which shows the number of arrivals at the station in the time-block that starts at the time, so it should look like this:
time id arrivals
0 2019-10-31 23:55:00 22 1
1 2019-10-31 23:50:00 22 5
2 2019-10-31 23:55:00 82 0
3 2019-10-31 23:25:00 82 325
4 2019-10-31 23:21:00 21 1
How could I use some high performance function to achieve this?
pandas.DataFrame.resample does not seem to be a possibility, since it requires the index to be a timestamp, and in this case several rows can have the same time.
df.groupby(['id',pd.Grouper(key='time', freq='5min')])\
.size()\
.to_frame('arrivals')\
.reset_index()
I think it's a horrible solution (couldn't find a better one at the moment), but it more or less gets you where you want:
df.groupby("id").resample("5min", on="time").count()[["id"]].swaplevel(0, 1, axis=0).sort_index(axis=0).set_axis(["arrivals"], axis=1)
Try with groupby and resample:
>>> df.set_index("time").groupby("id").resample("5min").count()
id
id time
21 2019-10-31 23:50:00 1
22 2019-10-31 23:55:00 1
82 2019-10-31 23:50:00 2
260 2019-10-31 23:55:00 1
I have read miscellaneous posts with a similar question but couldn't find exactly this question.
I have two pandas DataFrames that I want to merge.
They have timestamps as indexes.
The 2nd Dataframe basically overlaps the 1st and they thus both share rows with same timestamps and values.
I would like to remove these rows because they share everything: index and values in columns.
If they don't share both index and values in columns, I want to keep them.
So far, I could point out:
Index.drop_duplicate: this is not what I am looking for. It doesn't check values in columns are the same. And I want to keep rows with same timestamps but different values in columns
DataFrame.drop_duplicate: well, same as above, it doesn't check index value, and if rows are found with same values in column but different indexes, I want to keep them.
To give an example, I am re-using the data given in below answer.
df1
Value
2012-02-01 12:00:00 10
2012-02-01 12:30:00 10
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
df2
Value
2012-02-01 12:30:00 20
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
2012-02-02 14:00:00 10
Result I would like to obtain is the following one:
Value
2012-02-01 12:00:00 10 #(from df1)
2012-02-01 12:30:00 10 #(from df1)
2012-02-01 12:30:00 20 #(from df2 - same index than in df1, but different value)
2012-02-01 13:00:00 20 #(in df1 & df2, only one kept)
2012-02-01 13:30:00 30 #(in df1 & df2, only one kept)
2012-02-02 14:00:00 10 #(from df2)
Please, any idea?
Thanks for your help!
Bests
Assume that you have 2 following DataFrames:
df:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 13:00:00 20
3 2012-02-01 13:30:00 30
4 2012-02-02 14:00:00 10
5 2012-02-02 14:30:00 10
6 2012-02-02 15:00:00 20
7 2012-02-02 15:30:00 30
df2:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 21
2 2012-02-01 12:40:00 22
3 2012-02-01 13:00:00 20
4 2012-02-01 13:30:00 30
To generate the result, run:
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates().reset_index(drop=True)
The result, for the above data, is:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 12:30:00 21
3 2012-02-01 12:40:00 22
4 2012-02-01 13:00:00 20
5 2012-02-01 13:30:00 30
6 2012-02-02 14:00:00 10
7 2012-02-02 14:30:00 10
8 2012-02-02 15:00:00 20
9 2012-02-02 15:30:00 30
drop_duplicates drops duplicated rows, keeping the first.
Since no subset parameter has been passed, the criterion to treat
2 rows as duplicates is identity of all columns.
Just improving the first answer, insert Date inside drop_duplicates
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates('Date').reset_index(drop=True)
I need to write a function and then apply it for a dataframe's column in pandas.
My dataframe looks like this.Data is sorted by id and then by period columns.
period id column1
0 2013-01-31 5 NaT
1 2013-02-28 5 28 days
2 2013-03-31 5 31 days
3 2013-04-30 5 30 days
4 2016-05-31 6 NaT
5 2016-06-30 6 30 days
6 2016-08-31 6 62 days
The new column values should be defined according to values in column1:
if column1=NaT or column1>31
then new column eqauls to the value in period column
Else - values of new column should be copied from its previous row:
new column ith row= new column i-1 row.
I am very new to python and my code doesn't work:
def f(x):
if not x or x > 31
return x=df['period']
else
return x=x.shift()
df['newcolumn'] = df['column1'].apply(f)
The output should be this:
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31
Any help would be much appreciated.
first it might be necessary to convert period to datetime: using pd.to_datetime
df['period']=pd.to_datetime(df['period'])
Then you can Use Dataframe.where with DataFrame.ffill:
df['newcolumn']=df['period'].where((df["column1"]>pd.Timedelta("31 days"))|(df["column1"].isnull())).ffill()
print(df)
period id column1 newcolumn
0 2013-01-31 5 NaT 2013-01-31
1 2013-02-28 5 28 days 2013-01-31
2 2013-03-31 5 31 days 2013-01-31
3 2013-04-30 5 30 days 2013-01-31
4 2016-05-31 6 NaT 2016-05-31
5 2016-06-30 6 30 days 2016-05-31
6 2016-08-31 6 62 days 2016-08-31
you can use df.where(cond, other) which return return df's row if condition match else returns other
df["newcolumn"] = df["period"].where(df["column1"].isnull() | (df["column1"]>pd.TimeDelta("31D")), df["column1"].shift())
I have a series that looks like this
2014 7 2014-07-01 -0.045417
8 2014-08-01 -0.035876
9 2014-09-02 -0.030971
10 2014-10-01 -0.027471
11 2014-11-03 -0.032968
12 2014-12-01 -0.031110
2015 1 2015-01-02 -0.028906
2 2015-02-02 -0.035563
3 2015-03-02 -0.040338
4 2015-04-01 -0.032770
5 2015-05-01 -0.025762
6 2015-06-01 -0.019746
7 2015-07-01 -0.018541
8 2015-08-03 -0.028101
9 2015-09-01 -0.043237
10 2015-10-01 -0.053565
11 2015-11-02 -0.062630
12 2015-12-01 -0.064618
2016 1 2016-01-04 -0.064852
I want to be able to get the value from a date. Something like:
myseries.loc('2015-10-01') and it returns -0.053565
The index are tuples in the form (2016, 1, 2016-01-04)
You can do it like this:
In [32]:
df.loc(axis=0)[:,:,'2015-10-01']
Out[32]:
value
year month date
2015 10 2015-10-01 -0.053565
You can also pass slice for each level:
In [39]:
df.loc[(slice(None),slice(None),'2015-10-01'),]
Out[39]:
value
year month date
2015 10 2015-10-01 -0.053565|
Or just pass the first 2 index levels:
In [40]:
df.loc[2015,10]
Out[40]:
value
date
2015-10-01 -0.053565
Try xs:
print s.xs('2015-10-01',level=2,axis=0)
#year datetime
#2015 10 -0.053565
#Name: series, dtype: float64
print s.xs(7,level=1,axis=0)
#year datetime
#2014 2014-07-01 -0.045417
#2015 2015-07-01 -0.018541
#Name: series, dtype: float64