python pandas - group by date and count - python

I have the below dataframe. Date in DD/MM/YY
Date id
1/5/2017 2:00 PM 100
1/5/2017 3:00 PM 101
2/5/2017 10:00 AM 102
3/5/2017 09:00 AM 103
3/5/2017 10:00 AM 104
4/5/2017 09:00 AM 105
Need output such a way that , able to group by date and also count number of Ids per day , also ignore time. o/p new data frame should be as below
DATE Count
1/5/2017 2 -> count 100,101
2/5/2017 1
3/5/2017 2
4/5/2017 1
Need efficient way to achieve above.

Use:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = df['Date'].dt.date.value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
Alternative solution:
df1 = df.groupby(df['Date'].dt.date).size().reset_index(name='Count')
print (df1)
DATE Count
0 2017-05-01 2
1 2017-05-02 1
2 2017-05-03 2
3 2017-05-04 1
If need same format:
df1 = df['Date'].str.split().str[0].value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
new = df['Date'].str.split().str[0]
df1 = df.groupby(new).size().reset_index(name='Count')
print (df1)
Date Count
0 1/5/2017 2
1 2/5/2017 1
2 3/5/2017 2
3 4/5/2017 1

Related

consecutive rows of unsorted dates based on one day before after or on the same day into one [duplicate]

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

How can i concat Year, day, Hour and minute of a Dataframe

I had a Dataframe with this kind of date
Year
Day
Hour
Minute
2017
244
0
0
2017
244
0
1
2017
244
0
2
I want to create a new column on this DataFrame showing the date +hour minute but I don't know how to convert the days into months and unify everything
I try something using pd.to_datetime like the code below.
line['datetime'] = pd.to_datetime(line['Year'] + line['Day'] + line['Hour'] + line['Minute'], format= '%Y%m%d %H%M')
I would like to have something like this:
Year
Month
Day
Hour
Minute
2017
9
1
0
0
2017
9
1
0
1
2017
9
1
0
2
So in your case do
df['date'] = pd.to_datetime(df.astype(str).agg(' '.join,1),format='%Y %j %H %M')
Out[294]:
0 2017-09-01 00:00:00
1 2017-09-01 00:01:00
2 2017-09-01 00:02:00
dtype: datetime64[ns]
#df['month'] = df['date'].dt.month
#df['day'] = df['date'].dt.day
Try:
s = pd.to_datetime(df['Year'], format='%Y') \
+ pd.TimedeltaIndex(df['Day']-1, unit='D')
print(s)
# Output
0 2017-09-01
1 2017-09-01
2 2017-09-01
dtype: datetime64[ns]
Now you can insert your columns:
df.insert(1, 'Month', s.dt.month)
df['Day'] = s.dt.day
print(df)
# Output
Year Month Day Hour Minute
0 2017 9 1 0 0
1 2017 9 1 0 1
2 2017 9 1 0 2
df["Month"]=round(df["Day"]/30+.5).astype(int)
This establishes a new column and populates taht column by using the day column to calculate the month (total days / 30), rounding up by adding .5 and inserting it as an integer using astype
Example screenshot

Python converting date and time as pandas index

I want to convert my datetime column to be pandas dataframe index. This is my dataframe
Date Observed Min Max Sum Count
0 09/15/2018 12:00:00 AM 2 0 2 10 5
1 09/15/2018 01:00:00 AM 1 0 2 25 20
2 09/15/2018 02:00:00 AM 1 0 1 21 21
3 09/15/2018 03:00:00 AM 1 0 2 23 22
4 09/15/2018 04:00:00 AM 1 0 1 21 21
And I want the Date to be the index for the dataframe.
I've looked for answers and have tried this code
dateparse = lambda dates: pd.datetime.strptime(dates, '%m/%d/%Y %I:%M:%S').strftime('%m/%d/%Y %I:%M:%S %p')
data = pd.read_csv('mandol.csv', sep=';', parse_dates=['Date'], index_col = 'Date', date_parser=dateparse)
data.head()
but the result is still error -> ValueError: unconverted data remains: AM
how can I solve this?
Use pd.to_datetime() to convert the Date column and set_index() to set it as your dataframe index.
import pandas as pd
>>>df
Date Observed Min Max Sum Count
0 09/15/2018 12:00:00 AM 2 0 2 10 5
1 09/15/2018 01:00:00 AM 1 0 2 25 20
2 09/15/2018 02:00:00 AM 1 0 1 21 21
3 09/15/2018 03:00:00 AM 1 0 2 23 22
4 09/15/2018 04:00:00 AM 1 0 1 21 21
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)
>>>df
Unnamed: 0 Observed Min Max Sum Count
Date
2018-09-15 00:00:00 0 2 0 2 10 5
2018-09-15 01:00:00 1 1 0 2 25 20
2018-09-15 02:00:00 2 1 0 1 21 21
2018-09-15 03:00:00 3 1 0 2 23 22
2018-09-15 04:00:00 4 1 0 1 21 21
We can set the index to be the Date column values converted with to_datetime (I'm using pop to get values of the Date column and remove it from the DataFrame at the same time):
df.index = pd.to_datetime(df.pop('Date'))
print(df)
Output:
Observed Min Max Sum Count
Date
2018-09-15 00:00:00 2 0 2 10 5
2018-09-15 01:00:00 1 0 2 25 20
2018-09-15 02:00:00 1 0 1 21 21
2018-09-15 03:00:00 1 0 2 23 22
2018-09-15 04:00:00 1 0 1 21 21
Have a look at set_index() method.
If you use this code, it sets the second column (Date) as index and transforms it with the standard datetime parser provided by pandas.to_datetime:
ds = pd.read_csv('mandol.csv', sep=';', index_col=1, parse_dates=True)
parse_dates=True automatically transforms the index to a pandas Datetime object.

How often does a problem occur per day and location?

I have dataframe like this:
Date Location_ID Problem_ID
---------------------+------------+----------
2013-01-02 10:00:00 | 1 | 43
2012-08-09 23:03:01 | 5 | 2
...
How can I count how often a Problem occurs per day and per Location?
Use groupby with converting Date column to dates or Grouper with aggregate size:
print (df)
Date Location_ID Problem_ID
0 2013-01-02 10:00:00 1 43
1 2012-08-09 23:03:01 5 2
#if necessary convert column to datetimes
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby([df['Date'].dt.date, 'Location_ID']).size().reset_index(name='count')
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
Or:
df1 = (df.groupby([pd.Grouper(key='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))
If first column is index:
print (df)
Location_ID Problem_ID
Date
2013-01-02 10:00:00 1 43
2012-08-09 23:03:01 5 2
df.index = pd.to_datetime(df.index)
df1 = (df.groupby([df.index.date, 'Location_ID'])
.size()
.reset_index(name='count')
.rename(columns={'level_0':'Date'}))
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
df1 = (df.groupby([pd.Grouper(level='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))

Group python pandas dataframe per weeks (starting on Monday)

I have a dataframe with values per day (see df below).
I want to group the "Forecast" field per week but with Monday as the first day of the week.
Currently I can do it via pd.TimeGrouper('W') (see df_final below) but it groups the week starting on Sundays (see df_final below)
import pandas as pd
data = [("W1","G1",1234,pd.to_datetime("2015-07-1"),8),
("W1","G1",1234,pd.to_datetime("2015-07-30"),2),
("W1","G1",1234,pd.to_datetime("2015-07-15"),2),
("W1","G1",1234,pd.to_datetime("2015-07-2"),4),
("W1","G2",2345,pd.to_datetime("2015-07-5"),5),
("W1","G2",2345,pd.to_datetime("2015-07-7"),1),
("W1","G2",2345,pd.to_datetime("2015-07-9"),1),
("W1","G2",2345,pd.to_datetime("2015-07-11"),3)]
labels = ["Site","Type","Product","Date","Forecast"]
df = pd.DataFrame(data,columns=labels).set_index(["Site","Type","Product","Date"])
df
Forecast
Site Type Product Date
W1 G1 1234 2015-07-01 8
2015-07-30 2
2015-07-15 2
2015-07-02 4
G2 2345 2015-07-05 5
2015-07-07 1
2015-07-09 1
2015-07-11 3
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.TimeGrouper('W')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
df_final
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-05 12 6
1 W1 1234 2015-07-19 2 6
2 W1 1234 2015-08-02 2 6
3 W1 2345 2015-07-05 5 6
4 W1 2345 2015-07-12 5 6
Use W-MON instead W, check anchored offsets:
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.Grouper(freq='W-MON')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
print (df_final)
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-06 12 0
1 W1 1234 2015-07-20 2 0
2 W1 1234 2015-08-03 2 0
3 W1 2345 2015-07-06 5 0
4 W1 2345 2015-07-13 5 0
I have three solutions to this problem as described below. First, I should state that the ex-accepted answer is incorrect. Here is why:
# let's create an example df of length 9, 2020-03-08 is a Sunday
s = pd.DataFrame({'dt':pd.date_range('2020-03-08', periods=9, freq='D'),
'counts':0})
> s
dt
counts
0
2020-03-08 00:00:00
0
1
2020-03-09 00:00:00
0
2
2020-03-10 00:00:00
0
3
2020-03-11 00:00:00
0
4
2020-03-12 00:00:00
0
5
2020-03-13 00:00:00
0
6
2020-03-14 00:00:00
0
7
2020-03-15 00:00:00
0
8
2020-03-16 00:00:00
0
These nine days span three Monday-to-Sunday weeks. The weeks of March 2nd, 9th, and 16th. Let's try the accepted answer:
# the accepted answer
> s.groupby(pd.Grouper(key='dt',freq='W-Mon')).count()
dt
counts
2020-03-09 00:00:00
2
2020-03-16 00:00:00
7
This is wrong because the OP wants to have "Monday as the first day of the week" (not as the last day of the week) in the resulting dataframe. Let's see what we get when we try with freq='W'
> s.groupby(pd.Grouper(key='dt', freq='W')).count()
dt
counts
2020-03-08 00:00:00
1
2020-03-15 00:00:00
7
2020-03-22 00:00:00
1
This grouper actually grouped as we wanted (Monday to Sunday) but labeled the 'dt' with the END of the week, rather than the start. So, to get what we want, we can move the index by 6 days like:
w = s.groupby(pd.Grouper(key='dt', freq='W')).count()
w.index -= pd.Timedelta(days=6)
or alternatively we can do:
s.groupby(pd.Grouper(key='dt',freq='W-Mon',label='left',closed='left')).count()
a third solution, arguably the most readable one, is converting dt to period first, then grouping, and finally (if needed) converting back to timestamp:
s.groupby(s.dt.dt.to_period('W'))['counts'].count().to_timestamp()
# a variant of this solution is: s.set_index('dt').to_period('W').groupby(pd.Grouper(freq='W')).count().to_timestamp()
all of these solutions return what the OP asked for:
dt
counts
2020-03-02 00:00:00
1
2020-03-09 00:00:00
7
2020-03-16 00:00:00
1
Explanation: when freq is provided to pd.Grouper, both closed and label kwargs default to right. Setting freq to W (short for W-Sun) works because we want our week to end on Sunday (Sunday included, and g.closed == 'right' handles this). Unfortunately, the pd.Grouper docstring does not show the default values but you can see them like this:
g = pd.Grouper(key='dt', freq='W')
print(g.closed, g.label)
> right right

Categories