My dataframe looks like this:
1 2019-04-22 00:01:00
2 2019-04-22 00:01:00
3 2019-04-22 00:01:00
4 2019-04-22 00:01:00
5 2019-04-22 00:02:00
6 2019-04-22 00:02:00
7 2019-04-22 00:02:00
8 2019-04-22 00:02:00
9 2019-04-22 00:03:00
10 2019-04-22 00:03:00
11 2019-04-22 00:03:00
12 2019-04-22 00:03:00
As you can see there are four rows for each minute, what I would need is to add 15 secondes to each row so that it looks like this:
1 2019-04-22 00:01:00
2 2019-04-22 00:01:15
3 2019-04-22 00:01:30
4 2019-04-22 00:01:45
5 2019-04-22 00:02:00
6 2019-04-22 00:02:15
7 2019-04-22 00:02:30
8 2019-04-22 00:02:45
9 2019-04-22 00:03:00
10 2019-04-22 00:03:15
11 2019-04-22 00:03:30
12 2019-04-22 00:03:45
Any idea on how to proceed? I am not really good at datetime object so I am a bit stuck on that one... thank you in advance!
You can add timedeltas to datetimes column:
df['date'] += pd.to_timedelta(df.groupby('date').cumcount() * 15, unit='s')
print (df)
date
1 2019-04-22 00:01:00
2 2019-04-22 00:01:15
3 2019-04-22 00:01:30
4 2019-04-22 00:01:45
5 2019-04-22 00:02:00
6 2019-04-22 00:02:15
7 2019-04-22 00:02:30
8 2019-04-22 00:02:45
9 2019-04-22 00:03:00
10 2019-04-22 00:03:15
11 2019-04-22 00:03:30
12 2019-04-22 00:03:45
Detail:
First create counter Series by GroupBy.cumcount:
print (df.groupby('date').cumcount())
1 0
2 1
3 2
4 3
5 0
6 1
7 2
8 3
9 0
10 1
11 2
12 3
dtype: int64
Multiple by 15 and convert to seconds timedeltas by to_timedelta:
print (pd.to_timedelta(df.groupby('date').cumcount() * 15, unit='s'))
1 00:00:00
2 00:00:15
3 00:00:30
4 00:00:45
5 00:00:00
6 00:00:15
7 00:00:30
8 00:00:45
9 00:00:00
10 00:00:15
11 00:00:30
12 00:00:45
dtype: timedelta64[ns]
Related
I have a csv below:
ID Date Time Flag
1 14/05/2018 00:01:00 NaN
1 14/05/2018 00:02:00 NaN
1 14/05/2018 00:03:00 NaN
1 14/05/2018 00:04:00 NaN
1 14/05/2018 00:05:00 NaN
1 14/05/2018 00:06:00 NaN
1 14/05/2018 00:07:00 NaN
1 14/05/2018 00:08:00 NaN
1 15/05/2018 00:01:00 1
1 15/05/2018 00:02:00 1
1 16/05/2018 00:01:00 1
1 16/05/2018 00:02:00 1
2 10/07/2018 00:03:00 NaN
2 10/07/2018 00:04:00 NaN
2 10/07/2018 00:05:00 NaN
2 10/07/2018 00:06:00 NaN
2 10/07/2018 00:07:00 NaN
2 10/07/2018 00:08:00 NaN
2 11/07/2018 00:01:00 1
2 11/07/2018 00:02:00 1
2 12/07/2018 00:01:00 1
2 12/07/2018 00:02:00 1
I want to update NaN for only 4 rows above the first row (of only the first day and first time of that day) with Flag=1 for each ID.
Expected csv:
1 14/05/2018 00:01:00 NaN
1 14/05/2018 00:02:00 NaN
1 14/05/2018 00:03:00 NaN
1 14/05/2018 00:04:00 NaN
1 14/05/2018 00:05:00 1
1 14/05/2018 00:06:00 1
1 14/05/2018 00:07:00 1
1 14/05/2018 00:08:00 1
1 15/05/2018 00:01:00 1
1 15/05/2018 00:02:00 1
1 16/05/2018 00:01:00 1
1 16/05/2018 00:02:00 1
2 10/07/2018 00:03:00 NaN
2 10/07/2018 00:04:00 NaN
2 10/07/2018 00:05:00 1
2 10/07/2018 00:06:00 1
2 10/07/2018 00:07:00 1
2 10/07/2018 00:08:00 1
2 11/07/2018 00:01:00 1
2 11/07/2018 00:02:00 1
2 12/07/2018 00:01:00 1
2 12/07/2018 00:02:00 1
How can I do that. Thanks.
Since you're changing all Flag values to 1:
import pandas as pd
df = pd.read_csv('path/to/csv.csv')
df['Flag'] = 1
df.to_csv('path/to/csv.csv', index=False)
If, however, you don't want to update all Flag values, check out either loc or iloc for accessing specific parts of your DataFrame.
You need to combine a few different commands. To find the first row for each ID, use pandas groupby on multiple columns, ID and Date, like this:
df = pd.read_csv(input_file)
filtered_df = df.groupby(['ID', 'Date'])
After that you can copy the original dataframe based on the Date and Time of the filtered_df
I am trying to develop a more efficient loop to complete a problem. At the moment, the code below applies a string if it aligns with a specific value. However, the values are in identical order so a loop could make this process more efficient.
Using the df below as an example, using integers to represent time periods, each integer increase equates to a 15 min period. So 1 == 8:00:00 and 2 == 8:15:00 etc. At the moment I would repeat this process until the last time period. If this gets up to 80 it could become very inefficient. Could a loop be incorporated here?
import pandas as pd
d = ({
'Time' : [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6],
})
df = pd.DataFrame(data = d)
def time_period(row) :
if row['Time'] == 1 :
return '8:00:00'
if row['Time'] == 2 :
return '8:15:00'
if row['Time'] == 3 :
return '8:30:00'
if row['Time'] == 4 :
return '8:45:00'
if row['Time'] == 5 :
return '9:00:00'
if row['Time'] == 6 :
return '9:15:00'
.....
if row['Time'] == 80 :
return '4:00:00'
df['24Hr Time'] = df.apply(lambda row: time_period(row), axis=1)
print(df)
Out:
Time 24Hr Time
0 1 8:00:00
1 1 8:00:00
2 1 8:00:00
3 2 8:15:00
4 2 8:15:00
5 2 8:15:00
6 3 8:30:00
7 3 8:30:00
8 3 8:30:00
9 4 8:45:00
10 4 8:45:00
11 4 8:45:00
12 5 9:00:00
13 5 9:00:00
14 5 9:00:00
15 6 9:15:00
16 6 9:15:00
17 6 9:15:00
This is possible with some simple timdelta arithmetic:
df['24Hr Time'] = (
pd.to_timedelta((df['Time'] - 1) * 15, unit='m') + pd.Timedelta(hours=8))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time timedelta64[ns]
dtype: object
If you need a string, use pd.to_datetime with unit and origin:
df['24Hr Time'] = (
pd.to_datetime((df['Time']-1) * 15, unit='m', origin='8:00:00')
.dt.strftime('%H:%M:%S'))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time object
dtype: object
In general, you want to make a dictionary and apply
my_dict = {'old_val1': 'new_val1',...}
df['24Hr Time'] = df['Time'].map(my_dict)
But, in this case, you can do with time delta:
df['24Hr Time'] = pd.to_timedelta(df['Time']*15, unit='T') + pd.to_timedelta('7:45:00')
Output (note that the new column is of type timedelta, not string)
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00
I end up using this
pd.to_datetime((df.Time-1)*15*60+8*60*60,unit='s').dt.time
0 08:00:00
1 08:00:00
2 08:00:00
3 08:15:00
4 08:15:00
5 08:15:00
6 08:30:00
7 08:30:00
8 08:30:00
9 08:45:00
10 08:45:00
11 08:45:00
12 09:00:00
13 09:00:00
14 09:00:00
15 09:15:00
16 09:15:00
17 09:15:00
Name: Time, dtype: object
A fun way is using pd.timedelta_range and index.repeat
n = df.Time.nunique()
c = df.groupby('Time').size()
df['24_hr'] = pd.timedelta_range(start='8 hours', periods=n, freq='15T').repeat(c)
Out[380]:
Time 24_hr
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00
I have got a python related question about datetimes in a dataframe. I imported the following df via pd.read_csv()
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
I would like to know the time difference over the rows that are labeled with A, B, C as in the following:
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0:02
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B 0:09
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C 0:02
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
So the d_time should be the total time difference over labeled rows. There are approx. 100 different labels, and they can vary from 1 to x in a row. This calculation has to be done for +1 million rows, so a loop will probably not work. Does anybody know how to do this? Thanks in advance.
Assuming the consecutive labels are all the same, and seperated by 1 nan
you can do something like this
idx = pd.Series(df[pd.isnull(df['label'])].index)
idx_begin = idx.iloc[:-1] + 1
idx_end = idx.iloc[1:] - 1
d_time = df.loc[idx_end, 'datetime'].reset_index(drop=True) - df.loc[idx_begin, 'datetime'].reset_index(drop=True)
d_time.index = idx_begin
df.loc[idx_begin, 'd_time'] = d_time
If your dataset looks different, you might look into different ways to get to idx_begin and idx_end, but this works for the dataset you posted
Multiple consecutive nans
If there are multiple consecutive nan-values, you can solve this by adding this to the end
df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None
Consecutive different labels
idx = df[(df['label'] != df['label'].shift(1)) & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))))].index
idx_begin = idx[:-1]
idx_end = idx[1:] -1
This marks different labels as different starts and beginnings. To make this work, you will need the df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None added to the end
The & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))) part is because None != None
Result
datetime label d_time
0 2017-01-03 23:52:00 NaN NaN
1 2017-01-03 23:53:00 A NaN
2 2017-01-03 23:54:00 A NaN
3 2017-01-03 23:52:00 NaN NaN
4 2017-01-03 23:53:00 B NaN
5 2017-01-03 23:54:00 B NaN
6 2017-01-03 23:55:00 NaN NaN
7 2017-01-03 23:56:00 NaN NaN
8 2017-01-03 23:57:00 NaN NaN
9 2017-01-04 00:02:00 A NaN
10 2017-01-04 00:06:00 A NaN
11 2017-01-04 00:09:00 A NaN
12 2017-01-04 00:02:00 B NaN
13 2017-01-04 00:06:00 B NaN
14 2017-01-04 00:09:00 B NaN
15 2017-01-04 00:11:00 NaN NaN
yields
datetime label d_time
0 2017-01-03 23:52:00 NaN NaT
1 2017-01-03 23:53:00 A 00:01:00
2 2017-01-03 23:54:00 A NaT
3 2017-01-03 23:52:00 NaN NaT
4 2017-01-03 23:53:00 B 00:01:00
5 2017-01-03 23:54:00 B NaT
6 2017-01-03 23:55:00 NaN NaT
7 2017-01-03 23:56:00 NaN NaT
8 2017-01-03 23:57:00 NaN NaT
9 2017-01-04 00:02:00 A 00:07:00
10 2017-01-04 00:06:00 A NaT
11 2017-01-04 00:09:00 A NaT
12 2017-01-04 00:02:00 B 00:07:00
13 2017-01-04 00:06:00 B NaT
14 2017-01-04 00:09:00 B NaT
15 2017-01-04 00:11:00 NaN NaT
Last Series
If the last row doesn't have a changed label compared to the one before it, the last series will not register.
You can prevent this by including this after the first line
if idx[-1] != df.index[-1]:
idx = idx.append(df.index[[-1]]+1)
If the datetimes are datetime objects (or pandas.TimeStamp) you can use this for-loop
a_rows = []
for row in df.itertuples():
if row.label == 'A':
a_rows.append(row)
elif a_rows:
d_time = a_rows[-1].datetime - a_rows[0].datetime
df.loc[a_rows[0].Index, 'd_time'] = d_time
a_rows = []
with this result
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0 days 00:02:00
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 A 0 days 00:07:00
6 2017-01-04 00:06:00 A
7 2017-01-04 00:09:00 A
8 2017-01-04 00:11:00
You can later format the timedelta object if you want.
If the datetime column are strings you can easily convert em with df['datetime'] = pd.to_datetime(df['datetime'])
I have a dataframe named DateUnique made of all unique dates (format datetime or string) that are present in my other dataframe named A.
>>> print(A)
'dateLivraisonDemande' 'abscisse' 'BaseASDébut' 'BaseATDébut' 0 2015-05-27 2004-01-10 05:00:00 05:00:00
1 2015-05-27 2004-02-10 18:30:00 22:30:00
2 2015-05-27 2004-01-20 23:40:00 19:30:00
3 2015-05-27 2004-03-10 12:05:00 06:00:00
4 2015-05-27 2004-01-10 23:15:00 13:10:00
5 2015-05-27 2004-02-10 18:00:00 13:45:00
6 2015-05-27 2004-01-20 02:05:00 19:15:00
7 2015-05-27 2004-03-20 08:00:00 07:45:00
8 2015-05-29 2004-01-01 18:45:00 21:00:00
9 2015-05-27 2004-02-15 04:20:00 07:30:00
10 2015-04-10 2004-01-20 13:50:00 15:30:00
And:
>>> print(DateUnique)
1 1899-12-30
2 1900-01-01
3 2004-03-10
4 2004-03-20
5 2004-01-20
6 2015-05-29
7 2015-04-10
8 2015-05-27
9 2004-02-15
10 2004-02-10
How can I get the name of the columns that contain each date?
Maybe with something similar to this:
# input:
If row == '2015-04-10':
print(df.name_Of_Column([0]))
# output:
'dateLivraisonDemande'
You can make a function that returns the appropriate column. Use the vectorized isin function, and then check if any value is True.
df = pd.DataFrame({'dateLivraisonDemande': ['2015-05-27']*7 + ['2015-05-27', '2015-05-29', '2015-04-10'],
'abscisse': ['2004-02-10', '2004-01-20', '2004-03-10', '2004-01-10',
'2004-02-10', '2004-01-20', '2004-03-10', '2004-01-10',
'2004-02-15', '2004-01-20']})
DateUnique = pd.Series(['1899-12-30', '1900-01-01', '2004-03-10', '2004-03-20',
'2004-01-20', '2015-05-29', '2015-04-10', '2015-05-27',
'2004-02-15', '2004-02-10'])
def return_date_columns(date_input):
if df["dateLivraisonDemande"].isin([date_input]).any():
return "dateLivraisonDemande"
if df["abscisse"].isin([date_input]).any():
return "abscisse"
>>> DateUnique.apply(return_date_columns)
0 None
1 None
2 abscisse
3 None
4 abscisse
5 dateLivraisonDemande
6 dateLivraisonDemande
7 dateLivraisonDemande
8 abscisse
9 abscisse
dtype: object
I have a csv-file with entries like this:
1,2014 1 1 0 1,5
2,2014 1 1 0 1,5
3,2014 1 1 0 1,5
4,2014 1 1 0 1,6
5,2014 1 1 0 1,6
6,2014 1 1 0 1,12
7,2014 1 1 0 1,17
8,2014 5 7 1 5,4
The first column is the ID, the second the arrival-date (example of last entry: may 07, 1:05 a.m.) and the last column is the duration of work (in minutes).
Actually, I read in the data using pandas and the following function:
import pandas as pd
def convert_data(csv_path):
store = pd.HDFStore(data_file)
print('Loading CSV File')
df = pd.read_csv(csv_path, parse_dates=True)
print('CSV File Loaded, Converting Dates/Times')
df['Arrival_time'] = map(convert_time, df['Arrival_time'])
df['Rel_time'] = (df['Arrival_time'] - REF.timestamp)/60.0
print('Conversion Complete')
store['orders'] = df
My question is: How can I sort the entries according to their duration, but considering the arrival-date? So, I'd like to sort the csv-entries according to "arrival-date + duration". How is this possible?
Thanks for any hint! Best regards, Stan.
OK, the following shows you can convert the date times and then shows how to add the minutes:
In [79]:
df['Arrival_Date'] = pd.to_datetime(df['Arrival_Date'], format='%Y %m %d %H %M')
df
Out[79]:
ID Arrival_Date Duration
0 1 2014-01-01 00:01:00 5
1 2 2014-01-01 00:01:00 5
2 3 2014-01-01 00:01:00 5
3 4 2014-01-01 00:01:00 6
4 5 2014-01-01 00:01:00 6
5 6 2014-01-01 00:01:00 12
6 7 2014-01-01 00:01:00 17
7 8 2014-05-07 01:05:00 4
In [80]:
import datetime as dt
df['Arrival_and_Duration'] = df['Arrival_Date'] + df['Duration'].apply(lambda x: dt.timedelta(minutes=int(x)))
df
Out[80]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00
In [81]:
df.sort(columns=['Arrival_and_Duration'])
Out[81]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00