How to build bar plot with group by day/hour interval? - python

I have this dataset:
name date
0 ramos-vinolas-sao-paulo-2017-final 2017-03-05 22:50:00
1 sao-paulo-2017-doubles-final-sa-dutra-silva 2017-03-05 19:29:00
2 querrey-acapulco-2017-trophy 2017-03-05 06:08:00
3 soares-murray-acapulco-2017-doubles-final 2017-03-05 02:48:00
4 cuevas-sao-paulo-2017-saturday 2017-03-04 21:54:00
5 dubai-2017-doubles-final-rojer-tecau2 2017-03-04 18:23:00
I'd like to build bar plot with amount of news by day/hour. Something like
count date
4 2017-03-05
2 2017-03-04

I think you need dt.date with value_counts, for ploting bar:
#if necessary convert to datetime
df['date'] = pd.to_datetime(df.date)
print (df.date.dt.date.value_counts())
2017-03-05 4
2017-03-04 2
Name: date, dtype: int64
df.date.dt.date.value_counts().plot.bar()

A simple approach is using the pandas function hist():
df["date"].hist()

Related

How to count total days in pandas dataframe

I have a df column with dates and hours / minutes:
0 2019-09-13 06:00:00
1 2019-09-13 06:05:00
2 2019-09-13 06:10:00
3 2019-09-13 06:15:00
4 2019-09-13 06:20:00
Name: Date, dtype: datetime64[ns]
I need to count how many days the dataframe contains.
I tried it like this:
sample_length = len(df.groupby(df['Date'].dt.date).first())
and
sample_length = len(df.groupby(df['Date'].dt.date))
But the number I get seems wrong. Do you know another method of counting the days?
Consider the sample dates:
sample = pd.date_range('2019-09-12 06:00:00', periods=50, freq='4h')
df = pd.DataFrame({'date': sample})
date
0 2019-09-12 06:00:00
1 2019-09-12 10:00:00
2 2019-09-12 14:00:00
3 2019-09-12 18:00:00
4 2019-09-12 22:00:00
5 2019-09-13 02:00:00
6 2019-09-13 06:00:00
...
47 2019-09-20 02:00:00
48 2019-09-20 06:00:00
49 2019-09-20 10:00:00
Use, DataFrame.groupby to group the dataframe on df['date'].dt.date and use the aggregate function GroupBy.size:
count = df.groupby(df['date'].dt.date).size()
# print(count)
date
2019-09-12 5
2019-09-13 6
2019-09-14 6
2019-09-15 6
2019-09-16 6
2019-09-17 6
2019-09-18 6
2019-09-19 6
2019-09-20 3
dtype: int64
I'm not completely sure what you want to do here. Do you want to count the number of unique days (Monday/Tuesday/...), monthly dates (1-31 ish), yearly dates (1-365), or unique dates (unique days since the dawn of time)?
From a pandas series, you can use {series}.value_counts() to get the number of entries for each unique value, or simply get all unique values with {series}.unique()
import pandas as pd
df = pd.DataFrame(pd.DatetimeIndex(['2016-10-08 07:34:13', '2015-11-15 06:12:48',
'2015-01-24 10:11:04', '2015-03-26 16:23:53',
'2017-04-01 00:38:21', '2015-03-19 03:47:54',
'2015-12-30 07:32:32', '2015-11-10 20:39:36',
'2015-06-24 05:48:09', '2015-03-19 16:05:19'],
dtype='datetime64[ns]', freq=None), columns = ["date"])
days (Monday/Tuesday/...):
df.date.dt.dayofweek.value_counts()
monthly dates (1-31 ish)
df.date.dt.day.value_counts()
yearly dates (1-365)
df.date.dt.dayofyear.value_counts()
unique dates (unique days since the dawn of time)
df.date.dt.date.value_counts()
To get the number of unique entries from any of the above, simply add .shape[0]
In order to calculate the total number of unique dates in the given time series data example we can use:
print(len(pd.to_datetime(df['Date']).dt.date.unique()))
import pandas as pd
df = pd.DataFrame({'Date': ['2019-09-13 06:00:00',
'2019-09-13 06:05:00',
'2019-09-13 06:10:00',
'2019-09-13 06:15:00',
'2019-09-13 06:20:00']
},
dtype = 'datetime64[ns]'
)
df = df.set_index('Date')
_count_of_days = df.resample('D').first().shape[0]
print(_count_of_days)

Trying to convert Pandas Column with only [HH:DD], but getting [YYYY-DD-MM HH:MM:SS] back

I have a pandas column with the time format as [HH:DD], as shown below.
I want to change the type to a time, with a bit of googling and looking around; to_timedate was what I should use.
0 NaN
1 06:56
2 NaN
3 NaN
4 NaN
Name: Time, dtype: object
I hacked together this piece of code to do it:
df['Time'] = pd.to_datetime(df['Time'], format= '%H:%M', errors='coerce')
But now I get this returned:
0 NaT
1 1900-01-01 06:56:00
2 NaT
3 NaT
4 NaT
Name: Time, dtype: datetime64[ns]
I don't need the date, only thing I need is the HH:DD. I tried playing around with a few of the parameters, hoping I could figure it out, but no avail. Can anyone point me in the right direction?
If need processing later better is convert to timedeltas instead times by to_timedelta with add seconds:
df['Time'] = pd.to_timedelta(df['Time'].add(':00'), errors='coerce')
print (df)
Time
0 NaT
1 06:56:00
2 NaT
3 NaT
4 NaT
But it is possible - add Series.dt.time for python objects times:
df['Time'] = pd.to_datetime(df['Time'], format= '%H:%M', errors='coerce').dt.time
print (df)
Time
0 NaT
1 06:56:00
2 NaT
3 NaT
4 NaT

Convert custom object to standard datetime object in Pandas

I have a bit of an odd Series full of Date, Times which I want to convert to DateTime so that I can do some manipulation
allSubs['Subscribed']
0 12th December, 08:08
1 11th December, 14:57
2 10th December, 21:40
3 7th December, 21:39
4 5th December, 14:51
5 30th November, 15:36
When I call pd.to_datetime(allSubs['Subscribed']) on it, I get the error ' Out of bounds nanosecond timestamp: 1-12-12 08:08:00'. I tried to use param errors='coerce' but this just returns a series of nat. I want to convert the series into a pandas datetime object with format YYYY-MM-DD.
I've looked into using datetime.strptime but couldn't find an efficient way to run this against a series.
Any help much appreciated!
Use:
from dateutil import parser
allSubs['Subscribed'] = allSubs['Subscribed'].apply(parser.parse)
print (allSubs)
Subscribed
0 2018-12-12 08:08:00
1 2018-12-11 14:57:00
2 2018-12-10 21:40:00
3 2018-12-07 21:39:00
4 2018-12-05 14:51:00
5 2018-11-30 15:36:00
Or use replace by regex, also is necessary specify year, then use to_datetime by custom format - http://strftime.org/:
s = allSubs['Subscribed'].str.replace(r'(\d)(st|nd|rd|th)', r'\1 2018')
allSubs['Subscribed'] = pd.to_datetime(s, format='%d %Y %B, %H:%M')
print (allSubs)
Subscribed
0 2018-12-12 08:08:00
1 2018-12-11 14:57:00
2 2018-12-10 21:40:00
3 2018-12-07 21:39:00
4 2018-12-05 14:51:00
5 2018-11-30 15:36:00

How to sum all amounts by date in pandas dataframe?

I have dataframe with fields last_payout and amount. I need to sum all amount for each month and plot the output.
df[['last_payout','amount']].dtypes
last_payout datetime64[ns]
amount float64
dtype: object
-
df[['last_payout','amount']].head
<bound method NDFrame.head of last_payout amount
0 2017-02-14 11:00:06 23401.0
1 2017-02-14 11:00:06 1444.0
2 2017-02-14 11:00:06 0.0
3 2017-02-14 11:00:06 0.0
4 2017-02-14 11:00:06 290083.0
I used the code from jezrael's answer to plot the number of transactions per month.
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.dt.to_period('M')
.value_counts()
.sort_index()
.plot(kind="bar")
)
Number of transactions per month:
How do I sum all amount for each month and plot the output? How should I extend the code above for doing this?
I tried to implement .sum but didn't succeed.
PeriodIndex solution:
groupby by month period by to_period and aggregate sum:
df['amount'].groupby(df['last_payout'].dt.to_period('M')).sum().plot(kind='bar')
DatetimeIndex solutions:
Use resample by months (M) or starts of months (MS) with aggregate sum:
s = df.resample('M', on='last_payout')['amount'].sum()
#alternative
#s = df.groupby(pd.Grouper(freq='M', key='last_payout'))['amount'].sum()
print (s)
last_payout
2017-02-28 23401.0
2017-03-31 1444.0
2017-04-30 290083.0
Freq: M, Name: amount, dtype: float64
Or:
s = df.resample('MS', on='last_payout')['amount'].sum()
#s = df.groupby(pd.Grouper(freq='MS', key='last_payout'))['amount'].sum()
print (s)
last_payout
2017-02-01 23401.0
2017-03-01 1444.0
2017-04-01 290083.0
Freq: MS, Name: amount, dtype: float64
Then is necessary format x labels:
ax = s.plot(kind='bar')
ax.set_xticklabels(s.index.strftime('%Y-%m'))
Setup:
import pandas as pd
temp=u"""last_payout,amount
2017-02-14 11:00:06,23401.0
2017-03-14 11:00:06,1444.0
2017-03-14 11:00:06,0.0
2017-04-14 11:00:06,0.0
2017-04-14 11:00:06,290083.0"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=[0])
print (df)
last_payout amount
0 2017-02-14 11:00:06 23401.0
1 2017-03-14 11:00:06 1444.0
2 2017-03-14 11:00:06 0.0
3 2017-04-14 11:00:06 0.0
4 2017-04-14 11:00:06 290083.0
You can group by month-start ('MS') using resample:
df.set_index('last_payout').resample('MS').sum().plot(kind='bar')

grouping by weekly break down of datetimes 64 python

I have a pandas data frame with a column that represents dates as:
Name: ts_placed, Length: 13631, dtype: datetime64[ns]
It looks like this:
0 2014-10-18 16:53:00
1 2014-10-27 11:57:00
2 2014-10-27 11:57:00
3 2014-10-08 16:35:00
4 2014-10-24 16:36:00
5 2014-11-06 15:34:00
6 2014-11-11 10:30:00
....
I know how to group it in general using the function:
grouped = data.groupby('ts_placed')
What I want to do is to use the same function but to group the rows by week.
Pass
pd.DatetimeIndex(df.date).week
as the argument to groupby. This is the ordinal week in the year; see DatetimeIndex for other definitions of week.
you can also use Timergrouper
df.set_index(your_date).groupby(pd.TimeGrouper('W')).size()

Categories