I want to reduce my data. My initial dataframe looks as follows:
index
time [hh:mm:ss]
value1
value2
0
0 days 00:00:00.000000
3
4
1
0 days 00:00:04.000000
5
2
2
0 days 00:02:02.002300
7
9
3
0 days 00:02:03.000000
9
7
4
0 days 03:02:03.000000
4
3
Now I want to reduce my data in order to only keep the cells of every new minute (respectively also new hour and days). the other way around: only the first line of a new minute should be kept. all remaining lines of this minute should be dropped.
So the resulting table looks as follows:
index
time
value1
value2
0
0 days 00:00:00.000000
3
4
2
0 days 00:02:02.002300
7
9
4
0 days 03:02:03.000000
4
3
Any ideas how to approach this?
There is used timedeltas so is possible create TimedeltaIndex and use DataFrame.resample by 1Minute with Resampler.first, only are added all minutes, so removed only NaNs rows:
df.index = pd.to_timedelta(df['time [hh:mm:ss]'])
df = df.resample('1Min').first().dropna(how='all').reset_index(drop=True)
print (df)
time [hh:mm:ss] value1 value2
0 0 days 00:00:00.000000 3.0 4.0
1 0 days 00:02:02.002300 7.0 9.0
2 0 days 03:02:03.000000 4.0 3.0
You could extract the D:HH:MM using apply and multiple splits, and then delete the duplicates, choosing the first value.
dms = df['time [hh:mm:ss]'].apply(lambda x: ':'.join( [x.split(' days ')[0], *x.split('days ')[1].split(':')[:2]]) )
df.iloc[dms.drop_duplicates().index]
d = '''index,time,value1,value2 0,0 days 00:00:00.000000,3,4 1,0 days
00:00:04.000000,5,2 2,0 days 00:02:02.002300,7,9 3,0 days
00:02:03.000000,9,7 4,0 days 03:02:03.000000,4,3'''
df = pd.read_csv(StringIO(d),parse_dates=True)
df
df['time1'] = pd.to_datetime(df['time'].str.slice(7))
df.set_index('time1',inplace=True)
df
df.groupby([df.index.hour,df.index.minute]).head(1).sort_index().reset_index(drop=True)
Related
I have pandas dataframe that contains dates in column Date. I need to add another column Days which contains the date difference from previous cell. So date in ith cell should difference from i-1th. And for the first difference consider it to be 0.
Date Days
08-01-1997 0
09-01-1997 1
10-01-1997 1
13-01-1997 3
14-01-1997 1
15-01-1997 1
01-03-1997 45
03-03-1997 2
04-03-1997 1
05-03-1997 1
13-06-1997 100
I tried this but not useful.
First convert the Date column to pandas DateTime object, then calculate the difference which is timedelta object, from there, take the days from Series.dt and assign 0 to first value
>>> df['Date']=pd.to_datetime(df['Date'], dayfirst=True)
>>> df['Days']=(df['Date']-df['Date'].shift()).dt.days.fillna(0).astype(int)
OUTPUT
df
Date Days
0 1997-01-08 0
1 1997-01-09 1
2 1997-01-10 1
3 1997-01-13 3
4 1997-01-14 1
5 1997-01-15 1
6 1997-03-01 45
7 1997-03-03 2
8 1997-03-04 1
9 1997-03-05 1
10 1997-06-13 100
you can use diff as well
df['date_up'] = pd.to_datetime(df['Date'],dayfirst=True)
df['date_diff'] = df['date_up'].diff()
df['date_diff_num_days'] = df['date_diff'].dt.days.fillna(0).astype(int)
df.head()
Date Days date_up date_diff date_diff_num_days
0 08-01-1997 0 1997-01-08 NaT 0
1 09-01-1997 1 1997-01-09 1 days 1
2 10-01-1997 1 1997-01-10 1 days 1
3 13-01-1997 3 1997-01-13 3 days 3
4 14-01-1997 1 1997-01-14 1 days 1
This question already has an answer here:
How to count unique occurrences grouping by changing time period?
(1 answer)
Closed 1 year ago.
I need to aggregate data between constant date, like first day of year, and all the other dates through the year. There are two variants of this problem:
easier - sum:
created_at value
01-01-2012 5
02-01-2012 6
05-01-2012 1
05-01-2012 1
01-02-2012 3
02-02-2012 2
05-02-2012 1
which should output:
Date Month to date sum Year to date sum
01-01-2012 5 5
02-01-2012 11 11
05-01-2012 13 13
01-02-2012 3 14
02-02-2012 5 15
05-02-2012 6 16
and harder - count unique:
created_at value
01-01-2012 a
02-01-2012 b
05-01-2012 c
05-01-2012 c
01-02-2012 a
02-02-2012 a
05-02-2012 d
which should output:
Date Month to date unique Year to date unique
01-01-2012 1 1
02-01-2012 2 2
05-01-2012 3 3
01-02-2012 1 3
02-02-2012 1 3
05-02-2012 2 4
The data is, of course, in Pandas data frame.The obvious, but very clumsy way is to create for loop between the starting dates and the moving one. The problem looks like a popular one. Is there some reasonable pandas builtin way for such type of computation? Regarding counting unique I also want to avoid stacking lists, as I have large number of rows and unique values.
I was checking out Pandas window functions, but it doesn't look like a solution.
Try with groupby:
Cumulative sum:
df["created_at"] = pd.to_datetime(df["created_at"], format="%d-%m-%Y")
df["Month to date sum"] = df.groupby(df["created_at"].dt.month)["value"].transform('cumsum')
df["Year to date sum"] = df.groupby(df["created_at"].dt.year)["value"].transform('cumsum')
>>> df
created_at value Month to date sum Year to date sum
0 2012-01-01 5 5 5
1 2012-01-02 6 11 11
2 2012-01-05 1 12 12
3 2012-02-01 3 3 15
4 2012-02-02 2 5 17
5 2012-02-05 1 6 18
Cumulative unique count:
df2["created_at"] = pd.to_datetime(df2["created_at"], format="%d-%m-%Y")
df2["Month to date unique"] = df2.groupby(df2["created_at"].dt.month)["value"].apply(lambda x: (~x.duplicated()).cumsum())
df2["Year to date unique"] = df2.groupby(df2["created_at"].dt.year)["value"].apply(lambda x: (~x.duplicated()).cumsum())
>>> df2
created_at value Month to date unique Year to date unique
0 2012-01-01 a 1 1
1 2012-01-02 b 2 2
2 2012-01-05 c 3 3
3 2012-02-01 a 1 3
4 2012-02-02 a 1 3
5 2012-02-05 d 2 4
In one data frame (called X) I have Patient_admitted_id, Date, Hospital_ID of tested covid positive patients (I show this data frame below). I want to generate a separated data frame (called Y) with Dates of the calendar, total number of covid Cases and cumulative cases.
I dont know how to generate the column Cases
X data frame:
data = {'Patient_admitted_id': [214321,224323,3234234,23423],
'Date': ['2021-01-22', '2021-01-22','2021-01-22','2021-01-20'], # This is just an example I have created here, the real X data frame contains proper date values generated with Datatime
'Hospital_ID': ['1', '2','3','2'],
}
X = pd.DataFrame(data, columns=['Patient_admitted_id','Date', 'Hospital_ID' ])
X
Patient_admitted_id Date Hospital_ID
0 214321 2021-01-22 1
1 224323 2021-01-22 2
2 3234234 2021-01-22 3
3 23423 2021-01-20 2
...
Desirable Y data frame:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
...
Use DataFrame.resample by days with counts by Resampler.size with Series.cumsum for cumulative counts:
X['Date']= pd.to_datetime(X['Date'])
df = X.resample('D', on='Date').size().reset_index(name='Cases')
df['Cumulative'] = df['Cases'].cumsum()
print (df)
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
You can use groupby on Date column and call size to get the count of individual Dates, you can then simply call Cumsum on cases to get the desired output
out = X.groupby('Date').size().to_frame('Cases').reset_index()
out['Cumulative'] = out['Cases'].cumsum()
out variable holds the desired dataframe.
OUTPUT:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-22 3 4
Just adding a solution with pd.Grouper
X['Date']= pd.to_datetime(X['Date'])
df = X.groupby(pd.Grouper(key='Date', freq='D')).size().reset_index(name='Cases')
df['Cumulative'] = df.Cases.cumsum()
df
Output
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
I have a dataframe made up of daily data across a number of columns;
A B C D
01/01/2020 12 3 2 1
02/01/2020 8 14 5 1
03/01/2020 45 4 1 3
.
.
.
.
31/12/2021 5 1 5 3
The data is generated automatically but I would to be able to overwrite data by month or by date.
I understand something like this could reset a value but is there anyway to do it in bulk by month or between two certain dates?
df.set_value('C', 'x', 10)
Any help much appreciated!
Create DatetimeIndex first and the set values in DataFrame.loc, also here working partialy string indexing for set values of month:
df.index = pd.to_datetime(df.index, dayfirst=True)
df.loc['2020-01-02','C'] = 100
df.loc['2020-01','B'] = 500
df.loc['2020-01-01':'2020-01-02','A'] = 0
#select multiple columns by list
df.loc['2020-01-03':'2021-12-31', ['C','D']] = 1000
print (df)
A B C D
2020-01-01 0 500 2 1
2020-01-02 0 500 100 1
2020-01-03 45 500 1000 1000
2021-12-31 5 1 1000 1000
I have a Pandas Dataframe with data about calls. Each call has a unique ID and each customer has an ID (but can have multiple Calls). A third column gives a day. For each customer I want to calculate the maximum number of calls made in a period of 7 days.
I have been using the following code to count the number of calls within 7 days of the call on each row:
df['ContactsIN7Days'] = df.apply(lambda row: len(df[(df['PersonID']==row['PersonID']) & (abs(df['Day'] - row['Day']) <=7)]), axis=1)
Output:
CallID Day PersonID ContactsIN7Days
6 2 3 2
3 14 2 2
1 8 1 1
5 1 3 2
2 12 2 2
7 100 3 1
This works, however this is going to be applied on a big data set. Would there be a way to make this more efficient. Through vectorization?
IIUC this is a convoluted, but I think effective solution to your issue. Note that the order of your dataframe is modified as a result, and that your Day column is modified to a timedelta dtype:
Starting from your dataframe df:
CallID Day PersonID
0 6 2 3
1 3 14 2
2 1 8 1
3 5 1 3
4 2 12 2
5 7 100 3
Start by modifying Day to a timedelta series:
df['Day'] = pd.to_timedelta(df['Day'], unit='d')
Then, use pd.merge_asof, to merge your dataframe with the count of calls by each individual in a period of 7 days. To get this, use groupby with a pd.Grouper with a frequency of 7 days:
new_df = (pd.merge_asof(df.sort_values(['Day']),
df.sort_values(['Day'])
.groupby([pd.Grouper(key='Day', freq='7d'), 'PersonID'])
.size()
.to_frame('ContactsIN7Days')
.reset_index(),
left_on='Day', right_on='Day',
left_by='PersonID', right_by='PersonID',
direction='nearest'))
Your resulting new_df will look like this:
CallID Day PersonID ContactsIN7Days
0 5 1 days 3 2
1 6 2 days 3 2
2 1 8 days 1 1
3 2 12 days 2 2
4 3 14 days 2 2
5 7 100 days 3 1