How often does a problem occur per day and location?

How often does a problem occur per day and location? - python

I have dataframe like this:
Date Location_ID Problem_ID
---------------------+------------+----------
2013-01-02 10:00:00 | 1 | 43
2012-08-09 23:03:01 | 5 | 2
...
How can I count how often a Problem occurs per day and per Location?

Use groupby with converting Date column to dates or Grouper with aggregate size:
print (df)
Date Location_ID Problem_ID
0 2013-01-02 10:00:00 1 43
1 2012-08-09 23:03:01 5 2
#if necessary convert column to datetimes
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby([df['Date'].dt.date, 'Location_ID']).size().reset_index(name='count')
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
Or:
df1 = (df.groupby([pd.Grouper(key='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))
If first column is index:
print (df)
Location_ID Problem_ID
Date
2013-01-02 10:00:00 1 43
2012-08-09 23:03:01 5 2
df.index = pd.to_datetime(df.index)
df1 = (df.groupby([df.index.date, 'Location_ID'])
.size()
.reset_index(name='count')
.rename(columns={'level_0':'Date'}))
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
df1 = (df.groupby([pd.Grouper(level='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))

Related

Date interval average Python pandas

This is my dataframe:
ID number
Date purchase
1
2022-05-01
1
2021-03-03
1
2020-01-03
2
2019-01-03
2
2018-01-03
I want to get a horizontal dataframe with alle the dates in seperate columns per ID number.
So like this:
ID number
Date 1
Date 2
Date 3
1
2022-05-01
2021-03-03
2020-01-03
2
2019-01-03
2018-01-03
After I did this I want to calculate the difference between these dates.

First step is GroupBy.cumcount with DataFrame.pivot:
df['Date purchase'] = pd.to_datetime(df['Date purchase'])
df1 = (df.sort_values(by=['ID number', 'Date purchase'], ascending=[True, False])
.assign(g=lambda x: x.groupby('ID number').cumcount())
.pivot('ID number','g','Date purchase')
.rename(columns = lambda x: f'Date {x + 1}'))
print (df1)
g Date 1 Date 2 Date 3
ID number
1 2022-05-01 2021-03-03 2020-01-03
2 2019-01-03 2018-01-03 NaT
Then for differencies between columns use DataFrame.diff:
df2 = df1.diff(-1, axis=1)
print (df2)
g Date 1 Date 2 Date 3
ID number
1 424 days 425 days NaT
2 365 days NaT NaT
If need averages:
df3 = df1.apply(pd.Series.mean, axis=1).reset_index(name='Avg Dates').rename_axis(None, axis=1)
print (df3)
ID number Avg Dates
0 1 2021-03-02 16:00:00
1 2 2018-07-04 12:00:00

Could you do something like this?
def format_dataframe(df):
"""
Function formats the dataframe to the following:
| ID number| Date 1 | Date 2 | Date 3 |
| -------- | -------------- | -------------- | -------------- |
| 1 | 2022-05-01 | 2021-03-03 | 2020-01-03 |
| 2 | 2019-01-03 | 2018-01-03 | |
"""
df = df.sort_values(by=['ID number', 'Date purchase'])
df = df.drop_duplicates(subset=['ID number'], keep='first')
df = df.drop(columns=['Date purchase'])
df = df.rename(columns={'ID number': 'ID number', 'Date 1': 'Date 1', 'Date 2': 'Date 2', 'Date 3': 'Date 3'})
return df

initial situation:
d = {'IdNumber': [1,1,1,2,2], 'Date': ['2022-05-01', '2021-03-03','2020-01-03','2019-01-03','2018-01-03']}
df = pd.DataFrame(data=d)
date conversion:
df['Date'] = pd.to_datetime(df['Date'])
creating new column:
df1=df.assign(Col=lambda x: x.groupby('IdNumber').cumcount())
pivoting:
df1=df1.pivot(index=["IdNumber"],columns=["Col"],values="Date")
reset index:
df1 = df1.reset_index(level=0)
rename column:
for i in range(1,len(df1.columns)):
df1.columns.values[i]='Date{0}'.format(i)
final result:
Col IdNumber Date1 Date2 Date3
0 1 2022-05-01 2021-03-03 2020-01-03
1 2 2019-01-03 2018-01-03 NaT

consecutive rows of unsorted dates based on one day before after or on the same day into one [duplicate]

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!

You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0

Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

python pandas - group by date and count

I have the below dataframe. Date in DD/MM/YY
Date id
1/5/2017 2:00 PM 100
1/5/2017 3:00 PM 101
2/5/2017 10:00 AM 102
3/5/2017 09:00 AM 103
3/5/2017 10:00 AM 104
4/5/2017 09:00 AM 105
Need output such a way that , able to group by date and also count number of Ids per day , also ignore time. o/p new data frame should be as below
DATE Count
1/5/2017 2 -> count 100,101
2/5/2017 1
3/5/2017 2
4/5/2017 1
Need efficient way to achieve above.

Use:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = df['Date'].dt.date.value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
Alternative solution:
df1 = df.groupby(df['Date'].dt.date).size().reset_index(name='Count')
print (df1)
DATE Count
0 2017-05-01 2
1 2017-05-02 1
2 2017-05-03 2
3 2017-05-04 1
If need same format:
df1 = df['Date'].str.split().str[0].value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
new = df['Date'].str.split().str[0]
df1 = df.groupby(new).size().reset_index(name='Count')
print (df1)
Date Count
0 1/5/2017 2
1 2/5/2017 1
2 3/5/2017 2
3 4/5/2017 1

Append dataframe in for loop

If I have a pd dataframe with three columns: id, start_time, end_time, and I would like to transform it into a pd.df with two columns: id, time
e.g. from [001, 1, 3][002, 3, 4] to [001, 1][001, 2][001, 3][002, 3][002, 4]
Currently, I am using a for loop and append the dataframe in each iteration, but it's very slow. Is there any other method I can use to save time?

If start_time and end_time is timedelta use:
df = pd.DataFrame([['001', 1, 3],['002', 3, 4]],
columns=['id','start_time','end_time'])
print (df)
id start_time end_time
0 001 1 3
1 002 3 4
#stack columns
df1 = pd.melt(df, id_vars='id', value_name='time').drop('variable', axis=1)
#convert int to timedelta
df1['time'] = pd.to_timedelta(df1.time, unit='s')
df1.set_index('time', inplace=True)
print (df1)
id
time
00:00:01 001
00:00:03 002
00:00:03 001
00:00:04 002
#groupby by id and resample by one second
print (df1.groupby('id')
.resample('1S')
.ffill()
.reset_index(drop=True, level=0)
.reset_index())
time id
0 00:00:01 001
1 00:00:02 001
2 00:00:03 001
3 00:00:03 002
4 00:00:04 002
If start_time and end_time is datetime use:
df = pd.DataFrame([['001', '2016-01-01', '2016-01-03'],
['002', '2016-01-03', '2016-01-04']],
columns=['id','start_time','end_time'])
print (df)
id start_time end_time
0 001 2016-01-01 2016-01-03
1 002 2016-01-03 2016-01-04
df1 = pd.melt(df, id_vars='id', value_name='time').drop('variable', axis=1)
#convert to datetime
df1['time'] = pd.to_datetime(df1.time)
df1.set_index('time', inplace=True)
print (df1)
id
time
2016-01-01 001
2016-01-03 002
2016-01-03 001
2016-01-04 002
#groupby by id and resample by one day
print (df1.groupby('id')
.resample('1D')
.ffill()
.reset_index(drop=True, level=0)
.reset_index())
time id
0 2016-01-01 001
1 2016-01-02 001
2 2016-01-03 001
3 2016-01-03 002
4 2016-01-04 002

Here is my take on your question:
df.set_index('id', inplace=True)
reshaped = df.apply(lambda x: pd.Series(range(x['start time'], x['end time']+1)), axis=1).\
stack().reset_index().drop('level_1', axis=1)
reshaped.columns = ['id', 'time']
reshaped
Test
Input:
import pandas as pd
from io import StringIO
data = StringIO("""id,start time,end time
001, 1, 3
002, 3, 4""")
df = pd.read_csv(data, dtype={'id':'object'})
df.set_index('id', inplace=True)
print("In\n", df)
reshaped = df.apply(lambda x: pd.Series(range(x['start time'], x['end time']+1)), axis=1).\
stack().reset_index().drop('level_1', axis=1)
reshaped.columns = ['id', 'time']
print("Out\n", reshaped)
Output:
In
start time end time
id
001 1 3
002 3 4
Out
id time
0 001 1
1 001 2
2 001 3
3 002 3
4 002 4

Which is the fastest way to extract day, month and year from a given date?

I read a csv file containing 150,000 lines into a pandas dataframe. This dataframe has a field, Date, with the dates in yyyy-mm-dd format. I want to extract the month, day and year from it and copy into the dataframes' columns, Month, Day and Year respectively. For a few hundred records the below two methods work ok, but for 150,000 records both take a ridiculously long time to execute. Is there a faster way to do this for 100,000+ records?
First method:
df = pandas.read_csv(filename)
for i in xrange(len(df)):
df.loc[i,'Day'] = int(df.loc[i,'Date'].split('-')[2])
Second method:
df = pandas.read_csv(filename)
for i in xrange(len(df)):
df.loc[i,'Day'] = datetime.strptime(df.loc[i,'Date'], '%Y-%m-%d').day
Thank you.

In 0.15.0 you will be able to use the new .dt accessor to do this nice syntactically.
In [36]: df = DataFrame(date_range('20000101',periods=150000,freq='H'),columns=['Date'])
In [37]: df.head(5)
Out[37]:
Date
0 2000-01-01 00:00:00
1 2000-01-01 01:00:00
2 2000-01-01 02:00:00
3 2000-01-01 03:00:00
4 2000-01-01 04:00:00
[5 rows x 1 columns]
In [38]: %timeit f(df)
10 loops, best of 3: 22 ms per loop
In [39]: def f(df):
df = df.copy()
df['Year'] = DatetimeIndex(df['Date']).year
df['Month'] = DatetimeIndex(df['Date']).month
df['Day'] = DatetimeIndex(df['Date']).day
return df
....:
In [40]: f(df).head()
Out[40]:
Date Year Month Day
0 2000-01-01 00:00:00 2000 1 1
1 2000-01-01 01:00:00 2000 1 1
2 2000-01-01 02:00:00 2000 1 1
3 2000-01-01 03:00:00 2000 1 1
4 2000-01-01 04:00:00 2000 1 1
[5 rows x 4 columns]
From 0.15.0 on (release in end of Sept 2014), the following is now possible with the new .dt accessor:
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

I use below code which works very well for me
df['Year']=[d.split('-')[0] for d in df.Date]
df['Month']=[d.split('-')[1] for d in df.Date]
df['Day']=[d.split('-')[2] for d in df.Date]
df.head(5)

This is the cleanest answer I've found.
df = df.assign(**{t:getattr(df.data.dt,t) for t in nomtimes})
In [30]: df = pd.DataFrame({'data':pd.date_range(start, end)})
In [31]: df.head()
Out[31]:
data
0 2011-01-01
1 2011-01-02
2 2011-01-03
3 2011-01-04
4 2011-01-05
nomtimes = ["year", "hour", "month", "dayofweek"]
df = df.assign(**{t:getattr(df.data.dt,t) for t in nomtimes})
In [33]: df.head()
Out[33]:
data dayofweek hour month year
0 2011-01-01 5 0 1 2011
1 2011-01-02 6 0 1 2011
2 2011-01-03 0 0 1 2011
3 2011-01-04 1 0 1 2011
4 2011-01-05 2 0 1 2011

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How often does a problem occur per day and location? - python

I have dataframe like this: Date Location_ID Problem_ID ---------------------+------------+---------- 2013-01-02 10:00:00 | 1 | 43 2012-08-09 23:03:01 | 5 | 2 ... How can I count how often a Problem occurs per day and per Location?

Related

Date interval average Python pandas

consecutive rows of unsorted dates based on one day before after or on the same day into one [duplicate]

python pandas - group by date and count

Append dataframe in for loop

Which is the fastest way to extract day, month and year from a given date?

Categories

Resources